Some reasons NOT to write S-M (Self Modifying or Sado-Masochistic) code:
+ It's machine dependent (we already knew that).
+ Read-only (e.g., code) pages may be shared.
+ Some machines have sperate I and D spaces. It's not inconceivable
that different machines in the same family will have different
memory models.
+ Many machines use seperate instruction and data caches. Instruction
caches can be made much simpler (= smaller = faster) by not
requiring them to pay attention to the data bus. Writing to a
memory location is not guaranteed to update the I cache, thus you
may or may not execute the modified instruction. Next week's model
may have a large enough cache that this week's code breaks.
Good enough?
;-D on ( Compiler hints won't save you here ) Pardo
Tsk tsk -- out of context, like the movie ad that quotes a reviewer who said:
"This movie is a great waste of time -- it's insipid, boring, and stupid. I've
never seen anything like it."
The ad quoted: "...great...I've never seen anything like it."
I said the practice "...was formally declared horrid ...", NOT that *I* thought
it was. I don't think that.
> Some reasons NOT to write S-M (Self Modifying or Sado-Masochistic) code:
> + It's machine dependent (we already knew that).
Assembly code does tend to be machine dependent, all right -- but C was
designed to give the programmer the same facility, but with better control and
better discipline. It works quite well for those things it covers. It is, in
some reasonable sense, portable. A good compiler could generate self-modifying
code to suit, I'm sure. What we lack is the formal discipline for using it
wisely and well.
> + Read-only (e.g., code) pages may be shared.
Only in a time-sharing system. Most real-time control programs can't tolerate
such an environment for lots of reasons, mostly due to critical timing
requirements. Anyway, if generated code is treated as "executable data" this
whole point becomes irrelevant.
> + Some machines have sperate I and D spaces. It's not inconceivable
> that different machines in the same family will have different
> memory models.
Not a problem. Generated executable code changes, just like any other variable
values, and can be treated the same way.
> + Many machines use seperate instruction and data caches. Instruction
> caches can be made much simpler (= smaller = faster) by not
> requiring them to pay attention to the data bus. Writing to a
> memory location is not guaranteed to update the I cache, thus you
> may or may not execute the modified instruction. Next week's model
> may have a large enough cache that this week's code breaks.
>
Again, not a problem. You don't take the generated code out of the I cache,
but out of the D cache, since it can (by definition) change.
> Good enough?
Sorry, no. I've heard LOTS of arguments against programs that generate their
own code, and all of them -- so far as I am aware -- assume the "proper" answer
in generating arguments to show the answer is proper. Be honest: suppose you
*had* to do it this way, but needed a formal discipline to keep it clean and
understandable, and *fast.* Couldn't you find a reasonable way?
"It's unclean because ... well, because it's so ... so ... UNCLEAN!"
--
Ed Nather
Astronomy Dept, U of Texas @ Austin
{backbones}!{noao,ut-sally}!utastro!nather
nat...@astro.as.utexas.edu
There are time-sharing systems that can handle critical tasks (e.g.
real-time applications with strict timing requirements) well. These systems
have the possibility to give tasks priorities. A critical task can be given
a high priority to enable it to meet its time limits. Such a system may very
well have code that is shared.
Another related issue is that self-modifying code cannot be executed
directly from ROM. Executing programs in ROM is an important memory-saving
technique in small, diskless, special-purpose systems.
>> Good enough?
>
>Sorry, no. I've heard LOTS of arguments against programs that generate their
>own code, and all of them -- so far as I am aware -- assume the "proper"
>answer in generating arguments to show the answer is proper. Be honest:
>suppose you *had* to do it this way, but needed a formal discipline to keep
>it clean and understandable, and *fast.* Couldn't you find a reasonable way?
>
>"It's unclean because ... well, because it's so ... so ... UNCLEAN!"
I guess there are situations with extreme performance requirements where
self-modifying code is justified. It would be interesting to see what a
semantics for self-modifying programs would look like. Regarding the
"uncleanliness" I think there are two main sources for this. The first is
the difficulty for a human mind to trace what's going on in a self-modifying
program. The second is an advantage read-only data (e.g. non-self-modifying
code) has to writable data: read-only data can be accessed asynchronously,
in any order, even concurrently, without any constraints. Therefore
read-only is "clean" in a concurrent environment. An example apart from code
is data bases, where read-only data don't have to be locked during
transactions.
Bjorn Lisper
> Sorry, no. I've heard LOTS of arguments against programs that
> generate their own code, and all of them -- so far as I am aware
> -- assume the "proper" answer in generating arguments to show the
> answer is proper. Be honest: suppose you *had* to do it this way,
> but needed a formal discipline to keep it clean and understandable,
> and *fast.* Couldn't you find a reasonable way?
Let me rephrase my position. Their is nothing architecturally weird
about programs that generate their own code. Doing so does not cause
any problems on any machines that I am aware of, although few
OPERATING SYSTEMS support this.
There is a *major* problem with modifying existing code. Consider a
machine with seperate I-cache and D-cache. If you execute an
instruction at location 0x500, then one at 0x504 that branches back
to the one at 0x500, and the one at 0x500 modifies itself, either:
+ The cache can be built so that it pays attention to the changing
data (code) in the I-cache. This requires a "snoopy" cache, which
is more complicated, less dense, and slower. If you are
constrained by price or die space, you may be forced to use a
slower, smaller (worse hit-ratio) cache.
+ You can execute out of the D-cache. This is effectively the
same as having a snoopy I-cache. If you have BOTH an I-cache and
a D-cache, you're left with coherency problems that are solved by
building a snoopy I-cache.
+ You can define this operation to be illegal, build a smaller,
faster, non-snoopy cache, and pay the price when you need to run
code that could be expressed "naturally" as self-modifying code.
Assume that you have an algorithm that is, say, 1000 instructions and
that each instruction takes 4 cycles to execute (must be a RISC :-).
One particular part of it, 10 instructions, is written as
self-modifying code. It can also be written as 100 instructions of
non-self-modifying (static) code. Further assume that the effect of
having a non-snoopy cache is to make the computer run 10% faster
(e.g., the cache is faster AND has a better hit rate). Then the
running time of the program on the machine with the snoopy I-cache
is:
T = 990 * 4.0 + 10 * 4.0 = 4000 cycles
On the machine without a snoopy I-cache (10% faster),
T = 990 * 3.6 + 100 * 3.6 = 3924 cycles
which is marginally FASTER for NOT using self-modifying code,
even though it takes an additional 90 isntructions.
;-D on ( Cycles to burn reading flames ) Pardo
No, machines that do this don't work that way, i.e., you only execute
code from the I-cache or main memory, not from the D-cache (or the cache
currently acting as the D-cache, in cases where you can swap them).
There is nothing wrong with generating code on the fly.
What is wrong is assuming a model of machines that looks just like
a CPU + DRAM, or with caches that act only like coherent, joint I&D caches.
This is just one instance of a general problem: how do you handle
caches that are visible to the executable code in any way?
What's needed is a good set of cache-control primitives that:
a) Cover all of the cases
b) Become widely used enough for people to count on.
c) Are easy to make nops for enviornments where they don't matter.
--
-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR ma...@mips.com
DDD: 408-991-0253 or 408-720-1700, x253
USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
Actually, I always thought that ROM was much more important as a way of
eliminating the need to load the code and as a way of guaranteeing non-
self-modifying-ness. Usually ROM code is larger (thus memory-consuming)
than RAM-loaded code. This is because most RAM based systems assume some
form of selfmodification of code, ranging from relocation tables that avoid
the need for self-relative addressing everywhere to file initialization code
that gets overwritten by its own buffer. On the other hand, some RAM based
systems require 4 Mb to run an editor, so you may be right in those cases
:->} !!
Seriously, newer architectures have reduced the difference (to 0 perhaps),
but the emphasis on RISC these days may resurrect self-modifying code --
a RISC-er technique is not known to mankind! (:-D}===<<).
> Bjorn Lisper
Charles Marslett
Now, if the routines that ensure the consistency of the code change themselves,
then the world will most likely fall down around one's head, but then things get
to be so recursive that my stack oveflows ...
Hubert Matthews
I agree. That was the whole point of my original posting -- to see if someone
with a talent for languages could ignore the current prejudice against this
practice, and see what might be done. If there's really no way to do it in a
reasonable manner, fine -- but just *saying* it's awful doesn't *make* it awful.
> The first [source of uncleanliness] is
> the difficulty for a human mind to trace what's going on in a self-modifying
> program.
I think this arises because of the two different intellectual "levels" involved:
what the program is doing which generates the instructions (by analogy with the
instruction generator in a compiler) and what those instructions are going to
do when they are executed. These *must* be kept completely separate or massive
confusion results. But if the written code is approached with full awareness
of this requirement, the confusion vanishes -- in my experience, anyway.
> The second is an advantage read-only data (e.g. non-self-modifying
> code) has to writable data: read-only data can be accessed asynchronously,
> in any order, even concurrently, without any constraints. Therefore
> read-only is "clean" in a concurrent environment.
I believe self-modifying code was condemned long before concurrent environments
were even possible. This may be a disadvantage, but *no* techinque exists
without trade-offs. I think we can live with this one.
> An example apart from code
> is data bases, where read-only data don't have to be locked during
> transactions.
OK, but I'm not sure this is relevant here. I do, however, appreciate your
keeping "data" plural. It happens so rarely nowdays ...
or you can recognize the situation and issue a write-back-Dcache-line
instruction, and a flush-Icache-line instruction, and everything works.
Note that a compile-and-go situation is exactly this; you are building
a program in the Dcache, and then you want to execute it. While it may
be unlikely that any remnants are left over after linking, et. al., this
may be the only way to guarantee it.
--
{decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
As I wrote was actually thinking primarily of diskless systems. In such a
system the program is stored in ROM. If the program is self-modifying, its
code has to be loaded into RAM before it can execute. If not it can be
executed directly in ROM and only the data areas need to be RAM. Thus memory
is saved (we don't need the extra RAM to hold the code).
Bjorn Lisper
1. Simple programs that need OPTIMAL speed, and would be outrageously
expensive on ANY machine. The major example is BITBLT, a subroutine
with about 10-15 parameters that does a massive amount of linear-time
work. If the BITBLT subroutines generates the machine code expressly
for the particular arguments, you can save a factor of 3 or more in
speed. This is very important for interactive graphics.
2. Peter Deutsch and PARCPLACE systems used a new technique to speed
up Smalltalk on conventional computers. When a subroutine is
executed, the types are checked and if possible, the subroutine is
compiled for speed & executed. Then this compiled code is cached for
possible use later, when the arguments have the same types. This also
gave a very large constant speedup to the smalltalks marketed by
ParcPlace systems.
I think papers on both these techniques appeared in the first OOPSLA
conference.
In light of these new techniques, I believe it's time to reexamine
our opinions about self-modifying code.
Don Gillies, Dept. of Computer Science, University of Illinois
1304 W. Springfield, Urbana, Ill 61801
ARPA: gil...@cs.uiuc.edu UUCP: {uunet,ihnp4,harvard}!uiucdcs!gillies
For example, if treating code as data is the definition, then does passing a
procedure as a parameter in PASCAL, or a pointer to a function in C count?
Probably not.
OK, what about a jump table. Consider an array of pointers to functions in C.
Does changing the pointers count as SMC? Again, I don't think so.
So, changing a pointer by assigning to it is not SMC, but putting a new jump
instruction in (e.g. jmp #somewhere_else) in place of an existing instruction
is SMC. Does one level of indirection really make that much difference?
Of course, if you want to be really horrid in C, you can try something like
this:
char codearray[] = { 0x03, /* one byte instruction for something */
0xc9 /* one byte return instruction */
}
and then 'call' this function using a cast to turn the pointer codearray into
a pointer to a function. (Who needs #asm anyway!) Then you can modify the
code as much as you want. This _is_ SMC without a doubt, because you can
overwrite code. So, I propose a very weak definition for SMC as code that
writes over other code.
As a final note, why is it 'clean' to alter a jump table and 'unclean' to alter
an inline constant (e.g. jmp @offset(r0) uses a value in memory as the address
but mov (pc)+,#1234 which loads an immediate does so too)? Why the subtle
difference? Any thoughts on the subject?
Hubert Matthews
(I don't consider LISP or PROLOG programs that create code on the fly to be
SMC. Does anyone disagree?)
Two useful techniques. Both of them can be cast in the form of
"compile an entire procedure/function" (although I have seen BITBLTs
that compile code fragments and branch to them, I think those are
better understood as BITBLTs with streamlined function calling
sequences).
One of my favourite languages was POP-2, where the compiler was
a subroutine that could be used to compile code, returning what
was essentially a pointer to a function.
May I suggest that, perhaps, this is the moral: from the point of view
of programming principles, self-modifying code where the granule is
the procedure is not bad. Smaller granules are dangerous/hard to use.
(I've been running a pretty good streak of stupid mistakes in my
posts recently. Tell me all about them)
Andy "Krazy" Glew. Gould CSD-Urbana. 1101 E. University, Urbana, IL 61801
ag...@gould.com - preferred, if you have MX records
ag...@xenurus.gould.com - if you don't
...!ihnp4!uiucuxc!ccvaxa!aglew - paths may still be the only way
My opinions are my own, and are not the opinions of my employer, or any
other organisation. I indicate my company only so that the reader may
account for any possible bias I may have towards our products.
The key problem with self-modifying code is that code created at runtime may
REPLACE code that appeared at "compile" time, hence, in general, no STATIC
understanding of the code is possible. This can make it hard for humans to
understand the code, but more importantly it implies that compilers cannot
perform optimizations, e.g., you cannot perform register allocation / value
propagation / common subexpression elimination / in-line expansion because
the code you remove might be code which is modified at runtime.
The fix need not be the elimination of the generate-and-execute programming
concept, rather, the language merely needs to specify constraints on the code
which is to be generated at runtime. For example, key constraints would be
that the runtime generated code could not redefine any existing function,
could only be invoked through a statically-specified interface, and could only
access data which either was statically-declared as accessible through that
interface or was created as new data local to the new code. In other words,
we can permit code to add to itself, but not to redefine what existed at
compile time: self-extending code.
An outline of fixes to be applied to Lisp (to create "Refined Lisp") appears
in my PhD thesis: H. Dietz, "The Refined-Language Approach to Compiling for
Parallel Supercomputers," June 1987. The folks working on Scheme had similar
ideas....
-hankd
Consider the following code fragment (version A):
for (i = 0; i<L; i++)
{ if (c1) p1();
if (c2) p2();
...
if (cN) pN();
}
The Boolean variables c1, c2, ... cN are assumed to be loop invariants.
Assuming the pi all take the same constant time and code space, the
time complexity of this code is O(N), and its space complexity is also
O(N). (I'm ignoring the factor of L for number of loop iterations.)
Now consider the following equivalent code, specialized for N=2, but
clearly generalizable to any N (version B):
if (c1)
{ if (c2) for (i = 0; i<L; i++) { p1(); p2(); }
else for (i = 0; i<L; i++) { p1(); }
}
else
{ if (c2) for (i = 0; i<L; i++) { p2(); }
else /* Nothing */ ;
}
Code like this will have a time complexity of O(M), where M is the
number of ci that are actually true, and a space complexity of O(2^N).
Let's say that M, the number of true ci, is typically logN. This gives
the following comparison for versions A and B:
version space time
A O(N) O(N)
B O(2^N) O(logN)
Thus you get to choose between time that is exponentially worse than
it might be, or space that is exponentially worse than it might be.
The following lets you get both (version C):
start generating code;
if (c1) generate instruction to call p1;
if (c2) generate instruction to call p2;
...
if (cN) generate instruction to call pN;
for (i = 0; i<L; i++) execute generated instructions;
This gives you the low time complexity of version B, with the low
space complexity of version A. It's machine dependent, however.
It _is_ possible to get the same form of time complexity without
machine dependent operations, as follows (version D):
void (*pa)[N+1];
j = 0;
if (c1) pa[j++] = p1;
if (c2) pa[j++] = p2;
...
if (cN) pa[j++] = pN;
pa[j] = 0;
for (i = 0; i<L; i++)
{ for (j = 0; pa[j]; j++) (*pa[j])();
}
This essentially works by generating code in a very specialized
language, and then interpreting that code. This gives the same
form of time complexity as version B, but with a larger constant
factor - perhaps considerably larger if the pi aren't really procedures,
but just short code sequences.
This is the general sort of reason for the bit-blit algorithms that
generate code on the fly, and I've encountered the same sort of
situation in other contexts (e.g. a "life" program). One could
imagine a compiler performing the transformation from version A
to the machine-dependent version C, but I've never heard of such.
Even if it did, it is a bit disturbing that a transformation of
such significance for space/time complexity isn't expressed at the
source level. This would be solved by having the compiler transform
version D to the faster version C, but that's probably even harder
for the compiler-writer to accomplish than the A to C transformation.
Radford Neal
Why are an Icache plus a Dcache better than just
a big shared cache as big as both?
--
Peter da Silva `-_-' Ferranti International Controls Corporation.
"Have you hugged U your wolf today?" (uunet,tness1)!sugar!ficc!peter.
You can save an even greater factor by building the BitBlt into the
hardware. This is what the Commodore Amiga does, using a chip called
the Biltter. I'm pretty surprised that a chipset this good that's cheap
enough to be put in a consumer machine hasn't been picked up (or
emulated) by the big boys. It makes a significant difference to the
speed of the machine.
And you could cut that ugly self modifying code out of BitBlt. Instead
you just:
Wait for the blitter to be free (lock it).
Set up its parameters (it can take 3 source bitmaps and one
destination bitmaps, with any minterm)
Blit (at DMA speeds, up to one word per clock).
You can even return to the caller before the blit's finished, if the
locking is in hardware or if the blitter can generate an interrupt to
clear the semaphore.
And no LANGUAGES that I'm aware of. But that's the whole point. CAN they?
[ separate I-D cache arguments omitted, thus excising the neat idea
of a "snoopy" cache ]
> Assume that you have an algorithm that is, say, 1000 instructions and
> that each instruction takes 4 cycles to execute (must be a RISC :-).
> One particular part of it, 10 instructions, is written as
> self-modifying code. It can also be written as 100 instructions of
> non-self-modifying (static) code.
I wouldn't use self-generated code in such a case, I'd do it the way you
suggest. But in a different case:
The algorithm has 1000 or so set-up instructions, which generate a small,
fast loop that is executed for each and every pixel displayed on a screen,
but which must do different things to each pixel depending on current
program conditions -- maybe 10 or 12 decisions that can be folded into
perhaps 4 or 5 instructions, but which would require 10 or 12 branch
instructions (plus the executed instructions) for every pass, learning
over and over (and over, and over) what the constraining program conditions
are. And how many pixels are there on a screen? Well, lots -- 1024 X 1024
is almost obsolete now :-).
The alternative, to have a separate (static) loop for each possibility,
will run you out of space very quickly if the program conditions that make
sense are combinatorial in nature -- the commonest case. And a few hundred
copies of *nearly* the same loop would not be easy to maintain.
I believe we can agree (can we?) that some useful conditions can arise where
self-generated code can be very useful. The term "self-modifying code"
is, I think, a mis-nomer, implying a loop which changes itself into something
else, which then ...
But if code generation is kept separate from code execution (which might well
be a reasonable condition to impose in any formal description), I doubt any
serious confusion would arise, and I can't see that much would be lost. Thus
the term "self-generating" code might be better, or perhaps even "programmed
code generation." That makes it sound less scary.
"Self modifying code" such as was used on early computers, for example
early CDC 6600 machines (not flaming them, just the example I know)
was a NECESSARY technique for performing subroutine linkage. There was no
instruction to perform a jump to an arbitrary address (i.e. the return
address of the subroutine) so the subroutine entry code *had* to overwrite
a jump instruction's operand (at a known, per subroutine, address) before
calling. This was usually the instruction immediately at or before the start
of the subroutine. In psuedo code:
/* caller */ /* callee */
ld r0,#ret_addr subr: jmp anywhere /*TBA*/
st r0,subr+1 ...
jmp subr+2 ...
ret_addr: jmp subr /* return */
...
(Assume that it is word addressed and the jmp instruction takes one
word, as does its operand.)
THIS is self modifying code at its worst, gaaakk phut, urk, where is the
bathroom. Ever wondered why the early Fortran standards didn't even
*allow* recursion?
Even earlier machines used similar techniques for array indexing and other
irrelevant stuff.
"Run time generated code" is a whole different kettle of fish, and
seems to be what all of the proponents in the discussion have been referring
to. This is where issues such as separate I/D spaces/caches, operating
system support, and so on, come in.
BITBLT is, as noted, the classic example where run time generated code
can do the job much faster than any parameterised loops.
But load-and-go compilers are also a sensible thing in some environments.
The only difference between generating code in your own address space
and jumping to it, or generating it in a file and 'exec'ing it,
is that the operating system knows how to do the latter correctly.
Nobody really objects to optimising compilers, now do they? So
what is the problem with optimising code generators for some other class
of problem.
I agree with John Mashey that the most needed thing at the moment
is appropriate operating system or hardware support to address the
caching and accessability issues.
The second most needed thing is to stop calling "run time generated code"
"self modifying code" so people will think before barfing.
Greg Rose, Softway Pty Ltd, Australia
...uunet!munnari!gr...@softway.oz
"Assume your favourite disclaimer since I don't think I need one"
Path: esosun!seismo!uunet!mcvax!cernvax!hjm
From: h...@cernvax.UUCP (hjm)
Newsgroups: comp.lang.c,comp.arch
Summary: What is self-modifying code anyway?
Keywords: self-modifying code
Date: 13 Jul 88 09:44:20 GMT
References: <33...@cognos.UUCP> <6...@goofy.megatest.UUCP> <4...@uwovax.uwo.ca> <52...@june.cs.washington.edu> <12...@ut-sally.UUCP> <52...@june.cs.washington.edu>
Reply-To: h...@cernvax.UUCP ()
Organization: CERN European Laboratory for Particle Physics, CH-1211 Geneva, Switzerland
Lines: 36
Xref: esosun comp.lang.c:11534 comp.arch:5582
>> As a final note, why is it 'clean' to alter a jump table and 'unclean' to alter
>> an inline constant (e.g. jmp @offset(r0) uses a value in memory as the address
>> but mov (pc)+,#1234 which loads an immediate does so too)? Why the subtle
>> difference? Any thoughts on the subject?
Hubert Matthews
(I don't consider LISP or PROLOG programs that create code on the fly to be
SMC. Does anyone disagree?)
The main difference that I see is that the code in one case is not
reentrant. In systems like unix where it is possible for more than
one process to share a code segment at runtime, the jump table is
local to each process while the code is not. Imagine if vi had true
self-modifying code in it... (Many users typically share a code
segment for programs like vi and csh) The example in which you create
an array and then call it as a function does not really fit this
definition of self-modifying, since the code that is changed is
created in data space at runtime and hence cannot interfere with other
programs using the same code segment.
*Real* SMC would be something like:
main()
{
char func();
long *longptr;
longptr = (long *)func;
*(longptr + 20) = 0xffffffff; /* shudder!!!!! */
}
+-----------------------------------------------------------------------------+
| Jerry Jackson UUCP: seismo!esosun!jackson |
| Geophysics Division, MS/22 ARPA: esosun!jac...@seismo.css.gov |
| SAIC SOUND: (619)458-4924 |
| 10210 Campus Point Drive |
| San Diego, CA 92121 |
+-----------------------------------------------------------------------------+
> You can save an even greater factor by building the BitBlt into the
> hardware. This is what the Commodore Amiga does, using a chip called
> the Biltter. I'm pretty surprised that a chipset this good that's cheap
Isn't this in essence what CBM did with the "sprite"s on the Commodore
64? (Can I mention that machine in this group? I didn't think so. :-)
For those that never toyed with the 64, it has a graphics chip that
handles from 0-8(? I think 8) MOB's called "sprites" (stupid name).
You tell it what the MOB looks like, then just tell it to move the
MOB. It handles multi-color (out of 16), foreground/background, and
automatic collision detection. I was amazed to find that IBM PC's,
Apple II's, and the other early C64 competition didn't have this capability.
Anybody know what home/micro computer did it first? I'm pretty
sure the Vic-20 could do this, but I'm not sure about the PET.
--
Skate UNIX or go home, boogie boy...
[Obscure joke goes here]
J. Eric Townsend ->uunet!nuchat!flatline!erict smail:511Parker#2,Hstn,Tx,77007
..!bellcore!tness1!/
Of course, if the code isn't cached, perhaps the win is less
significant. If we want to toggle the don't cache bit (i.e. set
while modifying the code, clear after the code is written), it
may be difficult to claim a performance win for non-supervisor
processes. If it takes two system calls per SM operation, the
overhead is pretty high.
(If it is faster to flush the caches, that may be an option,
but there may be very little control over the granularity. How
would you implement a 'flush_address(base, bounds)' system call
on your favorite mmu/cache system?)
However..., if a user process can't get to the mmu the operating
system has a bit of a problem. I guess the trick is to write
protect program text and execute protect the data. Of course,
this causes a execute protect fault on the modified code page(s).
The kernel could then flush the caches and write protect the
page. Again, this is a lot of overhead.
(Yes, I know some mmus don't have an execute protect mode. Work
arounds may exist, eg. the function code mode on the 030. Ugh.)
All in all, a person had better have a good reason for writing
self modifying code. In some cases the performance win from
the code may be lost in overhead. There are many other facets
to this problem, too.
Cheers,
cj*
=========================================================
This message does not reflect the views of
UniSoft Corp but this sentence does.
I believe tha the first (or at least one of the earliest) was the old
ATARI 400/800, which had a system called player/missile graphics which had
quick movement independent of actually manipulating display memory and
collision detection. The chipset was designed by Jay Miner, who also designed
the Amiga's chipset.
Sprite (or PM or MOB) graphics are different from the blitting that is done
on the Amiga. The sprites are independent of the display memory, acting as
an overlay, while the blitter actually manipulates display memory (and any
other memory in it's address space). While the sprites are faster, they are
limited in terms of colors and width of image, while the blitter is a more
general purpose device.
------------------------------------------------------------------
Lewis Bruck
lbr...@eneevax.umd.edu
Not officially representing anybody, anywhere, for any reason
This was supposed to be a reply to Doug Pardo, but the Brain-Damaged-Mailer
bounced the reply, so I'm posting...
What about the case mentioned by someone for speeding up SmallTalk? In this
case, you're compiling a small routine, instead of the whole program, but
have the same information.
HP used something like this for a fast APL compiler
long ago in a galaxy far away. They compiled code for adding to objects
according to the object types at compilation time, and the code included a
check for just those types. If the check failed, it bounced back to the
compiler, which generated code for a more general case; so, for example,
the first time it might compile code for arrays of size i=constant, and if
the size changed, then it would compile slightly more general code for arrays
of size n=variable, then if that also failed, maybe for arrays of rank m=var,
size n=var, etc. It ran much faster than interpreters running on faster
machines so it can be a big win. Again, in all cases, you know which addresses
to flush before you branch to them.
--
{decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
But are we allowed to reminisce about intra-vertical-trace interrupts
and adjusting the sprites and colors on the fly to get more sprites
and colors than the hardware supported? (Talk about your realtime
applications! :-) )
(No, actually, I never did any of that.)
--
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain: ch...@mimsy.umd.edu Path: uunet!mimsy!chris
This view is valid, but simplifies the issues. If you are going to
think of the program as a database, you must actually consider it as a
distributed database. This is because the data is replicated in
several places: the physical memory, the swapping device, the CPU
caches, and internal processor registers. The problems of maintaining
consistency in replicated databases is much less understood than
concurrency control in non-distributed databases. Additionally, any
form of consistency maintenance must incur some performance penalty,
and hardware designers are under great pressure to make machines go as
fast as possible. Multiple caches that do not maintain consistency
are one answer to this. Since self-modifying code has generally been
considered bad style for quite some time, they felt comfortable
implementing a hardware optimization that assumes code will not modify
nearby locations.
Barry Margolin
Thinking Machines Corp.
bar...@think.com
{uunet,harvard}!think!barmar
As to the semantics of self-modifying code, isn't it identical to
a Turing machine's or am I missing something? Of course, reduction
machines, like Turner's combinatoric implementation of SASL, are
inherently self modifying.
MB Clausing
| In article <52...@june.cs.washington.edu>, pa...@june.cs.washington.edu
| (David Keppel) writes:
| >
| > Let me rephrase my position. There is nothing architecturally weird
| > about programs that generate their own code. Doing so does not cause
| > any problems on any machines that I am aware of, although few
| > OPERATING SYSTEMS support this.
| >
|
| And no LANGUAGES that I'm aware of. But that's the whole point. CAN they?
Pop11 and its predecessors Pop2, Pop10. In Pop11 the compiler is a bunch of
procedures that user programs can call to plant code which is then runnable
by those same user programs. But then that's what you expect in an incremental
programming environment.
In Poplog (the "home ground" of Pop11), the compiler procedures are also
accessible from the other languages of the system - Common Lisp, Prolog, ML.
The "other" languages are implemented using those same compiler routines.
The shared virtual machine language is translated to native code in most
Poplog implementations. "All" it takes to move the system is for the bottom
level to be ported - the rest of the compilers move for free.
So languages exist in which programs can generate "their own code". It's not
even peculiar to do so.
Regards,
Kers. | "If anything anyone lacks, they'll find it all ready in stacks".
PS. The recent "partial application in C" debate in comp.wahtever was
interesting as partial application is another Pop11 primitive procedure .....
In terms of hit/miss ratios, a unified cache is clearly better. However, in
terms of effective access time, a split I/D cache is better. This is because
both can be accessed simultaneously, instead of one having to wait for the
other to finish. If both are getting accessed simultaneously (which should
happen 40% of the time, if Loads/Stores account for %40 of instructions), then
this more than offsets the increase in miss ratio.
--
{decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
When the processor has separate instruction and data buses, so
concurrent accesses can occur to both caches.
--
-- Tim Olson
Advanced Micro Devices
(t...@delirun.amd.com)
(...version B deleted...)
>The following lets you get both (version C):
>
> start generating code;
> if (c1) generate instruction to call p1;
> if (c2) generate instruction to call p2;
> ...
> if (cN) generate instruction to call pN;
> for (i = 0; i<L; i++) execute generated instructions;
Interesting. This reminds a lot of the kind of compile-time manipulation
that optimizing compilers do. The loop
for (i = 0; i<L; i++) execute generated instructions;
is an optimized version of A *when the ci are known*. If they were known at
compile-time an optimizing compiler could have generated the same code. The
only difference is that the self-modifying program does this *at runtime*,
when we are guaranteed to know the values of the ci's. Maybe we should call
this technique "run-time compilation"?
Bjorn Lisper
For a given real-estate, a non-snoopy I-cache will hold more data
bits and be faster than a snoopy I-cache.
+ For a given real-estate, the I-cache hit rate will be better. This
makes the average instruction fetch time lower. This may be a win
if instruction fetch rate is at all a bottleneck.
+ For a given cache size, the real-estate can be smaller. This means:
+ The cache may be cheaper.
+ The cache may be small-enough to put closer to the instruction
register, making it effectively faster.
+ The logic is simpler and more regular.
+ Faster design times.
+ Fewer bugs.
+ Easier to incorporate into VLSI designs (==faster).
+ Less logic to drive => faster, less power.
+ The cache doesn't need to be connected to as many things.
+ More freedom to place the cache => faster, cheaper.
+ Less external logic to drive => smaller, faster, cheaper.
+ Aliasing and data distribution are less of (none) a problem.
This lets you build (with much less pain)
+ Heirarchical cache organizations (faster).
+ Virtually-addressed caches (faster).
I-caches are particularly useful on machines that make heavy use
of memory. This includes:
+ Many high-performance processors.
+ Shared-memory multiprocessors.
The primary problem (that I'm aware of) with non-snoopy I-caches
is that you must manually flush the I-cache every time that an
instruction is changed. (Strictly speaking, every time you change an
I-space address that has been touched since the last cold start).
Does that answer your question ?-)
;-D on ( Faster ) Pardo
> And no LANGUAGES that I'm aware of. But that's the whole point. CAN they?
Do you count Forth as a language?
For that matter, I've generated and then executed code in a program
written in Forth running under UNIX. Real code, not threaded code:
CREATE UNDER ASSEMBLER
S )+ R0 MOV,
S )+ R1 MOV,
R0 S -) MOV,
R1 S -) MOV,
R0 S -) MOV,
NEXT,
Which can immediately be used...
I beg to differ. We ran a few simulations, comparing split and unified
caches (total cache size in the 32K-512K range), and when the caches were
direct-mapped (i.e. 1-way set-associative) the unified cache performed worse.
Our guess is that when code and data are forced to co-exist in the same
space, you get a high probability of collision and thrashing. This effect
went away when we increased the degree of set associativity. One way to
think about it is that having two (direct-mapped) caches gives you some of
the benefits of set-associativity. Take this with a grain of NaCl: we only
tried a few test programs, and each was small enough to fit into the cache.
-Joe
Wouldn't the following be "nearly" as efficient, and be portable to boot?
At least it has the same asymptotic complexity.
enum { END, P1, P2, ..., Pn } code[MAXCODE], *p;
p = code;
if (c1) *p++ = P1;
if (c2) *p++ = P2;
...
if (cn) *p++ = Pn;
*p = END;
for ( pixel = 0; pixel++ < NPIXELS )
for ( p = code; *p != END; )
switch (*p++)
{
case P1:
[p1 instructions]
break;
case P2:
[p2 instructions]
break;
...
case Pn:
[pn instructions]
break;
default:
assert(0);
/*NOTREACHED*/
}
--
|| Tom Stockfisch, UCSD Chemistry t...@chem.ucsd.edu
There is a quite well known paper (and quite controversial) on the
subject of programs that can modify themselves and their interpreters
by Brian Smith
Reflection and Semantics in a procedural language
MIT-LCS-TR-272 Mass. Inst. Tech.
January 1982
Reflection and semantics in Lisp
11th ACM symposium on principles of programming languages
Then there is the reply to these papers by
Mitchel Wand and Daniel Freidman
The mystery of the tower revealed:
a non-reflective description of the reflective tower
CACM 1986
(this is as far as I can go since I have a copy of a copy and there
is no publishing information on it)
An extended version of this paper was published in
Meta-level architectures and reflection,
P. Maes and D. Nardi (editors)
Elsevier Science Publishers B.V. (North-Holland) 1988
There is a lot of theory about self modifying systems and
self referential systems but by the time you start looking into it
you are in philosophical logic, not comp.architecture
Se\'an Matthews
arpa: sean%uk.ac.e...@nss.cs.ucl.ac.uk
> Why are an Icache plus a Dcache better than just
> a big shared cache as big as both?
1) (Major) with split caches, you can access a word of instruction and
a word of data in the same cycle. With a joint cache, you get one or
the other. As a truly gross estimate, expect to lose 10-40% of your
performance, given conflicts between I-refs and loads/stores.
2) Less important, but still useful. As you make a direct-mapped cache
bigger, each successive 2X improves the hit-rate, but it improves it less than
the last 2X. At some point, it is hard to make it bigger and keep the same
speed SRAMs, and then the only way (keeping the same organization) to make
the hit rate faster is to make it set-associative. Given the same total
cache size, hit rates for many programs (not all, there are exceptions):
joint, direct-mapped <=
split, direct-mapped <=
joint, 2-way set associative
Note: those are Hit Rates, not performance. There are a bunch of reasons
to keep caches (at least the ones nearest the CPU) direct-mapped when
you can.
The following is a fine paper that analyzes the various tradeoffs in
cache design [rather than just hit rates], is:
S. Pryzbylski, M. Horowitz, J. Hennessy, "Performance Tradeoffs
in Cache Design", Proc. 15th Annual Symposium on Computer
Architecture, May 30 - June 2, 1988, Honolulu, Hawaii.
(Computer Architecture News 16, 2(May 1988), 290-298.
--
-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR ma...@mips.com
DDD: 408-991-0253 or 408-720-1700, x253
USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
Another relevant point about separate caches is this:
instructions and data have different access patterns. A cache
designed for the one is not necessarily going to be right for the
other; a cache designed for both may not do as well as the
separate ones. So this possibility has to be balanced against
the general benefit of having a larger pool to work with.
>For example, if treating code as data is the definition, then does passing a
>procedure as a parameter in PASCAL, or a pointer to a function in C count?
>Probably not.
>OK, what about a jump table. Consider an array of pointers to functions in C.
>Does changing the pointers count as SMC? Again, I don't think so.
>So, changing a pointer by assigning to it is not SMC, but putting a new jump
>instruction in (e.g. jmp #somewhere_else) in place of an existing instruction
>is SMC. Does one level of indirection really make that much difference?
A simple definition might be any program that cannot be compiled and run
in shared text mode (or equivalent for non Unix application environments).
Modifying a jump table in your data space does not affect how the program
will run for other users.
Modifying a jump instruction in the shared text *will* affect how the
program will run for other users.
I have used SMC in places that where by design not required to be shared and
where a high degree of efficency was required. For example in a p-System
type interpreter you need to have a test in the opcode fetch and dispatch
loop for a pseudo interrupt. This takes a lot of time. By simply patching in
a jmp to_int as the first instruction of the loop we eliminate the need for
an explicit test.
To make the above example work for multi-user systems, we used a jump
indirect through a var which contained either the ifetch address or
interrupt handler address. A little slower but still faster than explicit
test in the ifetch loop.
Similiar type of problems can arise when doing high performance device
drivers.
--
Stuart...@wimsey.bc.ca {ubc-cs,uunet}!van-bc!sl Vancouver,BC,604-937-7532
True. A pointer is data.
>OK, what about a jump table. Consider an array of pointers to functions in C.
>Does changing the pointers count as SMC? Again, I don't think so.
Again true. Note that it may be possible that the system (O.S.,
processor) checks whether the new pointer value represents a valid
address, or a valid entry point. If this is (always) desirable is an
entirely different question. E.g. some architectures will protest if
you try to jump to an odd address. PASCAL will leave you probably less
room to cheat than C.
>So, changing a pointer by assigning to it is not SMC, but putting a new jump
>instruction in (e.g. jmp #somewhere_else) in place of an existing instruction
>is SMC. Does one level of indirection really make that much difference?
Yes, I think it does make a difference. Maybe not always, but there are
cases that you can't get away with this: think of re-entrant code, or
shared text segments. Now I'm thinking of recursion, what about putting
the code on the stack 8-) ? No worry about re-entrancy, and the C
program model becomes more complete (we have already global, static and
automatic data, and global and static functions, and now there's
automatic functions...).
>Of course, if you want to be really horrid in C, you can try something like
>this:
>
>char codearray[] = { 0x03, /* one byte instruction for something */
> 0xc9 /* one byte return instruction */
> }
This must be (or was a) Z80 freak! At least I recognize the 0xc9 as:
RET. On a Z80 you could do other horrible things too, since
instruction sizes vary from 1-4 bytes; by carefully picking your
instructions you could use the same code twice (using a one-off entry
point), with entirely different result. Good (?) old days, when memory
was more expensive than programming effort....
>and then 'call' this function using a cast to turn the pointer codearray into
>a pointer to a function. (Who needs #asm anyway!) Then you can modify the
>code as much as you want. This _is_ SMC without a doubt, because you can
>overwrite code. So, I propose a very weak definition for SMC as code that
>writes over other code.
Note that not all implementations will allow initialized arrays to be
altered. I remember a problem we had last year on VMS while passing a
initialized char array to mktemp() (or whatever it's name is); the
program stackdumped because mktemp tried to write into the readonly
array (yes, VMS put it in readonly!).
>As a final note, why is it 'clean' to alter a jump table and 'unclean' to alter
>an inline constant (e.g. jmp @offset(r0) uses a value in memory as the address
>but mov (pc)+,#1234 which loads an immediate does so too)? Why the subtle
>difference? Any thoughts on the subject?
Try putting your code in ROM and see what happens. Just one example.
Besides I think jump tables are generally not altered, pointers are.
The jump tables could well be in the text segment, for instance.
A jump table is not an array of function pointers.
> Hubert Matthews
>
>(I don't consider LISP or PROLOG programs that create code on the fly to be
>SMC. Does anyone disagree?)
No, unless the code is buggy; such code on a fly could well be
Self Multiplying 8-).
Leo.
Exploiting shared library designs might make this more feasible in the
future, it occurs to me that it might be possible to dynamically build
a file object that the OS will consider a shared library text and then
page it back into the text section. I would imagine some changes might
be needed to current designs (eg. do the shareable objects and symbols
have to be fully specified at initial load-time?)
Speaking of lisp and self-modifying code I worked on an IBM370 lisp
which accomplished trace and stepping modes (debugging) by
self-modifying the top of the eval loop. Of course it didn't save much
more than a test and branch, but eval loops are entered for every
operator/function (eg add, subtract etc etc etc), on a 370/158 with
less than a MIP and 100 logged in users one becomes a fanatic about
such things, I don't know how much it helped in fact.
And there was always the IBM370 MVC (move characters) instruction
which only took a fixed number of chars specified at assemble time, a
common sequence was (particular register numbers irrelevant):
stc r1,*+5
mvc 0(r2,0),0(r3)
which overwrote the second zero (,0), the length. The 370 of course
provided an EX (execute) instruction which allowed indirect execution
of an instruction after OR'ing in a particular register's low byte,
naturally that was designed to accomplish the above (SS instructions
which took a fixed length field all had the same basic format using
either the byte or two nibbles as a length indicator.) Why people
tended to use the store char hack above rather than EX was always a
mystery to me, other than tribal practice.
-Barry Shein, Boston University
No. The state mechanism of a Turing machine is fixed and finite.
To show equivalence, it is necessary to show that self modifying code
does not increase the number of states. Usually, people just appeal
to Church-Turing and assume all the states are there, somewhere.
For something like a load and go compiler, since every possible program
is not hidden somewhere in the compiler, it would appear to escape CT.
However the compiled program can be regarded as a transformation of the
input tape into another region of the tape and its apparent execution
as an interpretation of the tape by the compiler/loader/cpu state machine.
Then again, if you believe really hard and clap your hands.....
sm ryan
----------------------------------
Hugh: Why does he carry a sword?
Fred: He has delusions of grandeur.
him: Uh, Fred...
(SWOOP)
Fred: YEEK
him: ...before you make a value judgement about a person's delusions...
(poink)
Fred: AWK! My head feathers!
him: ...you'd better be absolutely sure they ARE delusions.
(JAB)
Fred: OUCH!
-- Odd Bodkins
Actually, the discussion will never end because it changes as it goes along.
Oh, and regarding the chap who asked for a definition of SMC: SMC is code that,
when you look at it, you go "Blecch-- this is self-modifying code."
Many Lisp (Prolog, Smalltalk, ...) systems generate a pseudocode
rather than native instructions. The p-code is run by an an
interpreter rather than the hardware. Thus, as far as the hardware is
concerned, it is only the interpreter and NOT the p-code that is being
executed.
A number of more sophisticated Lisp systems DO generate native code
for themselves (as opposed to Lisp systems where you compile an
program to native code and once you start it, you can't load any more
compiled code). These systems, however, are typically targeted for
particular machines (e.g., InterLisp for Xerox Dandylions) and thus it
is "safe" to make assumptions about the target hardware.
;-D on ( Me and my big segment ) Pardo
NO NO NO NO!!!!!! The CDC 6600 (and 7600, and the Crays, as well) had a Jump
to Subroutine instruction: RJ <addr> (Return Jump) (on the 6600); what this
would do is write a <JP {calling address+1},NOP,NOP> into <addr> (the
machine could pack up to 4 instructions in a word [60 bit words, 15 or 30
bit instructions], so the assembler had to look for labels for RJ
instructions and pad out the rest of the word with NOPs 8-)). (The machine
did this because it lacked one simple little feature found on many of
today's machines: a stack.) Therefore, when you did a RJ, it would perform
the indicated write, and start execution at <addr>+1, and, to return, you
would JP (or do an unconditional branch, which wouldn't clear out the
instruction cache) to the entry point, and boom! you're back where you want
to be. Because of other features of the machine (such as banks of memory
[i.e., the memory was accessed as one of 8(?) banks of RAM, with each
succeeding address in a different bank. Accessing any individual address could
be, oh, 8 times slower than the machine cycle, but accessing the next
address wouldn't be. Therefore, when you did an RJ, it could write to
<addr>, and then get <addr+1> at the next cycle], and the fact that you
could write to a location and execute another instruction [instruction
overlap]), this was *much* faster than having to maintain a stack would be,
and it also didn't tie up any of the registers (it only had 24 of them, one
of which was hardwired to 0, one of which was usually 1, and 7 of which, if
accessed, would either screw up memory or another register).
They were very fast machines, as I've said here before. If only they didn't
have NOS...
>Greg Rose, Softway Pty Ltd, Australia
>...uunet!munnari!gr...@softway.oz
Sean.
---------------
Sean Eric Fagan
seanf@sco
(408) 458-1422, ext. 3561
Uh, I don't know what systems you're talking about, but for the small
microcontrollers that I've used -- the ROM is assembled to be placed
in the memory map at an absolute location. That is, it takes up the
same amount of space whether it's in ROM or RAM (in fact, one of my
favorite hardware designs is a little SRAM card that I use to spoof
ROM for prototyping purposes). As for re-using memory buffers etc.,
those are always in RAM, so those are always re-usable -- no matter
where your executable is loaded.
32K bytes of ROM costs about 1/4th of what 32K bytes of static RAM
costs (not to mention the eliminating-need-to-load). Note that these
are dedicated control systems using 6502's, Z-80's, and 6809's, which
have a total address space of 64K -- so 32K of ROM is a heckuva lot.
> Seriously, newer architectures have reduced the difference (to 0 perhaps),
> but the emphasis on RISC these days may resurrect self-modifying code --
> a RISC-er technique is not known to mankind! (:-D}===<<).
Actually, RISC may be the stake-in-the-heart of self-modifying code.
RISC technology overcomes the increased memory bandwidth requirement
by using enhanced cache technology, 256-byte-at-a-time prefetch from
page-mode DRAM's, heavy pipelining, and every other trick in the book.
Most of which are quite contrary to the thought of self-modifying
code. Another RISC trick is to use the saved silicon for extra
registers, and have the compilers keep as many valus as possible in
the CPU registers. You can always do a one-word memory-indirect load
off of a register much faster than you can do a "memory-write(the
address into the instruction) memory-read(the address part of the
modified instruction) memory-read" triplet (which is at least three
two-word instructions, as vs. a single one-word instruction, on a
machine where one cycle = one word processed).
But, I HAVE done self-modifying code on a 6502-based microcontroller
(out of sheer necessity). I still make an occasional enhancement to
that software, over 3 years later, as new applications arise, and have
had no problems with readability. After all, in assembly on a
microcontroller, you have to save the address somewhere, and the
fact that the "somewhere" happens to be inside an assembly language
instruction makes little difference. The argument against
self-modifying code lies elsewhere besides readability.
--
Eric Lee Green ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg
Snail Mail P.O. Box 92191 Lafayette, LA 70509
MISFORTUNE, n. The kind of fortune that never misses.
........
> NO NO NO NO!!!!!! The CDC 6600 (and 7600, and the Crays, as well) had a Jump
> to Subroutine instruction: RJ <addr> (Return Jump) (on the 6600); what this
> would do is write a <JP {calling address+1},NOP,NOP> into <addr>
........
The RJ would set up the return address, but how about the address in the call?
It is not that unusual to have subroutines or functions as arguments of called
subroutines, or computed by the program in some other way.
I have used a program which allowed the user to decide which way something
should be done in a small subprogram, with the main program having many calls
to a dummy procedure in that module which changed the call to the desired
program. This way the dozens of places in the main program would compile the
call to the dummy, and the first execution would change the call (and the
return at that time).
--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hru...@l.cc.purdue.edu (Internet, bitnet, UUCP)
>No. The state mechanism of a Turing machine is fixed and finite.
>To show equivalence, it is necessary to show that self modifying code
>does not increase the number of states. Usually, people just appeal
>to Church-Turing and assume all the states are there, somewhere.
This needs some qualification. Strictly speaking no computers
are Turing machines because they do not have an infinite memory or
equivalent. However almost all computers would be universal Turing
machines if they somehow did.
However it depends on what level you are speaking of. If you
are talking about the hardware, that's a *qualified* Turing machine.
You can talk about a language being a virtual Turing machine; that makes
sense, albeit you want to be more careful with your language than I am
being here. The language remains fixed, and most languages are universal
Turing machines. In this case the SMC is data. If you want to talk
about a specific program as a virtual Turing machine (probably not
universal) then, yes, it can't modify itself.
--
In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die.
Richard Harter, SMDS Inc.
From: pa...@june.cs.washington.edu (David Keppel)
>In article <23...@bu-cs.BU.EDU> b...@bu-cs.BU.EDU (Barry Shein) writes:
>>Did I miss something critical here? Most lisp systems generate and
>>execute there own machine code.
>
>Many Lisp (Prolog, Smalltalk, ...) systems generate a pseudocode
>rather than native instructions. The p-code is run by an an
>interpreter rather than the hardware. Thus, as far as the hardware is
>concerned, it is only the interpreter and NOT the p-code that is being
>executed.
Prolog and Smalltalk are not Lisp, the similarities are abstract and
most of the implementations of the former tend towards experimental
while many mature Lisp production environments exist (note: this is
not a criticism, I wouldn't be shocked to hear one or two prologs
exist that have a true compiler, I'm just saying it's pretty much
irrelevant to my point.)
Name three Lisps that people on this list are likely to have heard of
which generate a "p-code" and not some form of machine language from
their compiler. I can't think of one (unless you're referring to T or
Scheme which are more or less lisp, a lot closer than Prolog or
Smalltalk anyhow.) I don't think I'd count Gnu Emacs' lisp, but
consider it mentioned. It wasn't intended as a general purpose lisp
although it's almost certainly the most used production lisp today.
>A number of more sophisticated Lisp systems DO generate native code
>for themselves (as opposed to Lisp systems where you compile an
>program to native code and once you start it, you can't load any more
>compiled code). These systems, however, are typically targeted for
>particular machines (e.g., InterLisp for Xerox Dandylions) and thus it
>is "safe" to make assumptions about the target hardware.
Interlisp originally ran on PDP-10's, although I'd admit that the 1100
series machines are microcoded to look a lot like a PDP-10. Of course
all lisp-machine lisps tend towards one architecture, that was sort of
the whole point. Symbolics, LMI, Xerox and TI.
Franz Lisp generates direct machine code for several machines (I know
of Vax, 68K and NS32K compilers, I have one of each of those here.) I
believe there was a 3B2 version and probably several others (there
must be a celerity version for running Macsyma.)
KCL (Kyoto common lisp) generates C code which is compiled by the
system's native C compiler. Most of the port work (if any) is in
accommodating differences in object formats and dynamic loading (and a
very few machine coded routines.) I'd call that direct machine code,
even if another program is used as a backend (cc) to finish it up.
Lucid and Allegro Common Lisp both generate machine code as far as I
know for several architectures, at least Vax, 68K, Celerity
(proprietary) and certainly several others.
There are some PC lisps I don't know much about.
Maclisp was definitely locked into the PDP-10 and is ancient history
now.
I don't know about Spice Lisp, not sure if it ever ventured off the
10, ancient history anyhow as far as I've heard. Hedrick used it to
produce a very nice Common Lisp for the '20 which was definitely
locked into that architecture.
Betz's xlisp, last I checked, didn't have a compiler.
PSL (Portable Standard Lisp from U. Utah) generates direct machine
code from an abstract LAP code back-end translator and works on
several architectures (at least Vax, 68K and 370.)
NIL generates machine code but is completely locked into the VAX
(mostly a Macsyma delivery vehicle I believe.)
Have I missed any major lisps? Probably a few, but they really are
enumerable if one sticks to some criteria of popularity, no need to
develop sweeping generalizations (by either of us.)
-Barry Shein, Boston University
Mike Coffin mi...@arizona.edu
Univ. of Ariz. Dept. of Comp. Sci. {allegra,cmcl2,ihnp4}!arizona!mike
Tucson, AZ 85721 (602)621-4252
Technically,
+ EQ caller_address+1
- VFD 30/0
>bit instructions], so the assembler had to look for labels for RJ
Compass forces upper on all labels.
>instructions and pad out the rest of the word with NOPs 8-)).
And SB0 B0+46000.
>did this because it lacked one simple little feature found on many of
>today's machines: a stack.)
Actually hardware support of a stacks. I implemented reentrant and recursive
stack frames for doing multitasking in Compass.
>the indicated write, and start execution at <addr>+1, and, to return, you
>would JP (or do an unconditional branch, which wouldn't clear out the
>instruction cache) to the entry point, and boom! you're back where you want
And sneaky code puts exit code above the entry point so that it falls into
jump back to the caller:
RTN0 exit processing
RTN PS entry point.
junk
ZR ...,RTN0
more junk
>[i.e., the memory was accessed as one of 8(?) banks of RAM, with each
8, Aye.
>succeeding address in a different bank. Accessing any individual address could
>be, oh, 8 times slower than the machine cycle, but accessing the next
Actually, I think a major cycle was 10 minor cycles.
>They were very fast machines, as I've said here before. If only they didn't
>have NOS...
Meaning NOS 1 I hope.
NOS 2 has a lot nifty features I wish Unix had.
The best data I have seen says that a big shared cache as big as both
would be better - all other things being equal.
The other things include stuff like cycle time, simultaneous access,
line sizes.
For example, with split I/D caches you can have simultaneous access
to both the I and D cache in the same cycle; with a shared cache this
is harder to do. The I cache doesn't need to have expensive bus
watching circuitry for invalidations. The I and D caches can be
located physically closer to where they will be used. The smaller
the cache, the faster. An I cache can be optimized for instruction
access, which is much more sequential than data access (this is the
same type of argument that can lead to a separate cache, or no cache,
for vectors).
Andy "Krazy" Glew. Gould CSD-Urbana. 1101 E. University, Urbana, IL 61801
ag...@gould.com - preferred, if you have MX records
ag...@xenurus.gould.com - if you don't
...!ihnp4!uiucuxc!ccvaxa!aglew - paths may still be the only way
My opinions are my own, and are not the opinions of my employer, or any
other organisation. I indicate my company only so that the reader may
account for any possible bias I may have towards our products.
Ok, how's this for a sweeping generalization: I'm dead wrong.
Sorry, guys, but I do sometimes clutter the net needlessly.
I apologize for my disinformation.
;-D on ( Oooh, did I just break netiquette, apologizing ?-) Pardo
In other words: a self-modifying discussion! :-)
Bjorn Lisper
> ;-D on ( Oooh, did I just break netiquette, apologizing ?-) Pardo
Don't worry. The practice is very unlikey to catch on. We can
just consider this to be an isolated incident.
-- Dave J.
I can see several reasons.
* the big, big reason for referring to code via pointers, and getting the effect
of self-modifying code via such pointers, is that you make your changes
independent of the size of the code. Real SMC will only work when the new
code is no larger than the old code. I think this is a very restrictive
assumption.
* when you alter a jump table (in C, at least) you are doing so within the
language, and can expect the compiler to understand you. A language which
allows you to modify instructions directly would of necessity depend strongly
on the machine architecture to run these instructions. Otherwise, why don't
we all use Universal Assembly Language?
Roland Conybeare
cony...@moncsbruce.oz
an instruction, like mov (pc)+,#1234 you are assuming that the change you
make
[a long, and generally well-reasoned debate elided]
>>Sorry, no. I've heard LOTS of arguments against programs that generate their
>>own code,
Gentles, could we ***PLEASE*** reserve the term "self-modifying code" for
code which actually modifies itself on the fly (eg, for generating
indexes by instruction modification instead of using index registers)?
Much of what is being discussed here is part of the well-known and
respectable "generate and execute" paradigm. The only difference
between this and the normal code-generation paradigm is that an
application generates code at run-time, not a compiler at a previous
time.
Mixing the two is making this discussion "noisy".
--dave (sorry about the pedantry, but its important) c-b
--
David Collier-Brown. {mnetor yunexus utgpu}!geac!daveb
Geac Computers Ltd., | Computer science loses its
350 Steelcase Road, | memory, if not its mind,
Markham, Ontario. | every six months.
The first home system I know of with "sprites" was the Atari
2600 game machine. They called them "players". They were very limited
by the hardware; 128 bytes of RAM, 4K ROM, no DMA so the CPU had to do
everything in real time. The first computers that had this capability
were the Atari 400 and 800, the machines CBM targeted with the C64.
They had 4 "players" and 4 "missles" (2 pixel wide sprites), and all of
them were DMA driven. The chipset used in the Atari 400 and 800 was
designed by Jay Minor, the same person who designed the chipset used
in the Amiga.
------------------------------------------------------------------------------
Dan Moore
AT&T Denver
ihnp4!druhi!dlm
------------------------------------------------------------------------------
What about the PolyForth scheduler. Each entry in the process table contained
a jump instruction. The next instruction was the save space for the current
PC in a context switch. So a context switch involved jumping to the next
processes process table entry. When a process waited, it put the address of
the next process table entry in that location. This meant that the process
was going to effectively busy-wait, but it's a *very* efficient busy-wait.
For the sort of tight real-time control that PolyForth was designed for, this
was apparently the most efficient way of doing things.
--
Peter da Silva `-_-' Ferranti International Controls Corporation.
"Have you hugged U your wolf today?" (uunet,tness1)!sugar!ficc!peter.
(1) There are a couple of commercial Prolog systems which generate native
code. In the interests of higher performance for large programs on
small (4-8Mbyte) machines, Quintus Prolog is _not_ one of them, but
that is another (and perhaps more interesting) story.
(2) All the commercial Prolog systems that I know of offer an interface
to some other language (Lisp, C, or Pop). So even the systems which
use pseudocode rather than native code aren't off the code-modifying
hook: if you want to be able to load C code at run time you still
have to make additional parts of your address space executable.
(But once a region of the address space has turned into code, it need
never be changed again.)
(3) VMS, SunOS 4.0, and I believe Apollo's AEGIS support dynamic extension
of programs by mapping *shareable* library code into a program's
address space (System V.3 on a 386 sort of kludges this in too).
Would anyone care to explain what architectural features are needed for
efficient dynamically mapped shared code? Is position independent code
essential, or only better? What popular architectures couldn't do it?
If SPARC can do it, is there anything about RISCs which makes it easier/
harder?
No. You can save a large factor over *poor* software implementations of
BitBlt. Not over good ones. The dynamic-compilation implementations on
the Blit et al run the memory at nearly full speed for the usual cases
of BitBlt; it is fundamentally impossible to do better. Commodore et al
find hardware implementations attractive because they don't know how to
write good software.
--
Anyone who buys Wisconsin cheese is| Henry Spencer at U of Toronto Zoology
a traitor to mankind. --Pournelle |uunet!mnetor!utzoo! henry @zoo.toronto.edu
For those of you interested in on-the-fly generated code, here is a reference
to an interesting article:
Turning interpreters into Compilers
Frank Pagan -C.S.U. Fullerton
June 88 Software- Practice and Experience
--
{decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
Back to the subroutine call in the CPU, which worked by planting a return
instruction at the target, then jumping one word beyond. Herman Rubin
asked: "And how would you call a function which is an argument?"...
> The RJ would set up the return address, but how about the address in the call?
> It is not that unusual to have subroutines or functions as arguments of called
> subroutines, or computed by the program in some other way.
Of course--even in FORTRAN this is common, and the 6600 was a FORTRAN
machine if ever there was one. The answer is simple: You emulate the RJ
instruction. You form the return instruction (actually this can be done at
compile time to save execution) and plop it into memory just as the RJ
would have done; then you jump to the target routine. The jump address is
formed from a register plus a constant, so it can be a variable or computed
address with no problem. This approach roughly doubles the time to call a
routine; the return is unaffected.
[It is one of life's enduring mysteries why the "RJ" (subroutine call)
instruction allowed only a constant operand, while the "JP" (unconditional
branch) allowed register plus constant! There was room in the RJ
instruction format for a register.]
It's interesting to look at what it costs to do subroutine calls with
return addresses on a stack (instead of plunked into memory) so that you
can have recursive calls. It's only about 1.5* the time for the native
call/return to constant address, and it's about a break-even for call/
return to variable address. (It's slightly faster on the 6400, slower on
the 6600.)
--
Dick Dunn UUCP: {ncar,nbires}!ico!rcd (303)449-2870
...Are you making this up as you go along?
One time last summer I changed my password right before the weekend, and
wouldn't you know it, come Monday I had no clue as to what the password was.
I tried all kinds of disassembly and debug tracing to no avail. Hence, another
reasons for self-modifying code.
--harlan
Hubert Matthews
I disagree. It is not necessarily true that good software blit code can
run as fast as hardware supported blit operations. Sometimes, maybe, but
not always. Henry's claim "it is fundamentally impossible to do better"
doesn't fly. What about a CPU with a 16 bit bus competing with a blit
chip with 64 bit access to the frame buffer? What about a CPU that must
fetch instructions on the same bus used for data?
Henry's statement is over-broad.
-- David Schachter
#include "disclaimer.legal"
#include "funny.quote"
But it's an ill wind, etc... The fact that self-modifying code behaved
differently (but predictably so) on different processor models was used to
create a code sequence which could identify the processor at run time and
let code adapt itself accordingly.
Issues like ROM and reentrance weren't the only headaches. How to write a
compiler?
I have in my possesion the assembler source for an HP2100 Algol 60 compiler.
Now, that's some real macho hacking. Especially with no subtract
instruction, and about 2 registers ..
--
Don lin...@k.gp.cs.cmu.edu CMU Computer Science
"Imitation is not the sincerest form of flattery. Payments are."
- a British artist who died penniless before copyright law.
However, a hardware BitBlt using DMA can be operating while the CPU is
servicing other processes, with no slowdown in CPU speed. So even if
you can squeeze your BitBlt routine down to the bare minimum and it
takes x CPU cycles to do the blit, that is x cycles that are wasted.
Granted, if you are using a single-tasking machine, you won't get
much speedup using hardware blits. Since the Amiga is multi-tasking,
its blitter chip does provide speedup. Also, the Amiga blitter can do
logical operations (on three inputs) just as fast as a normal memory
copy - It'd be pretty hard to hack some 680x0 code to XOR three inputs
and write to three outputs as fast as it can move memory. Also, the
overhead is very low compared to the overhead involved in dynamic compilation.
If dynamic-compilation is that much faster, then you would have expected
someone to have written a program that does uses this technique on the Amiga,
rather than using the blitter. So far, I haven't seen any. If one shows up
that is comparable in speed to the hardware, I would bet that the overhead
and memory involved are high enough that the programs usefulness is diminished.
--
Darin Johnson (...pyramid.arpa!leadsv!laic!darin)
(...ucbvax!sun!sunncal!leadsv!laic!darin)
"All aboard the DOOMED express!"
I had the "pleasure" to maintain such a system during my previous existence
as an engineer, in the early eghties. Our system had a whopping 96K 16-bit
words primary memory and two disk drives of 5Mwords and 20Mwords,
respectively. It was purchased sometimes in the late seventies for a sum I
would guess was well above $20,000. (The system was used!) The software for
this machine was, well, medieval. Some of the (mis)features of its operating
system, RT-IVB, were:
* It did have time-sharing, but the memory was statically partitioned and
the only way to change the partitioning was to generate a new copy of the
operating system!
* Max address space for a program 64K. No virtual memory, so if you needed
more then explicit segmentation with code overloading was the only way to
go.
* Dumb terminal drivers that didn't recognize ^S & ^Q, not to mention the
nifty block transfer protocols their own 264x series terminals were
capable of (which, amongst other things, made these terminals' function
keys useless).
* A contiguous-storage oriented file system that frequently lead to
disk space fragmentation problems. Whenever that happened you had to
switch to single-user mode and run a disk compaction program...
* Much, much more, but I have occupied enough bandwidth with this already. I
do have to mention, though, the listings of reported bugs that came every
three months or so. A not too unusual comment after a bug report was: "this
bug will not be fixed"! I guess HP didn't put in too much manpower into
supporting this system. I also guess that they were right in not doing so...
When I write this, using GNU emacs under Xwindows on a $4000 sun 3/50, I
realize that things have gone the right way since my engineering days.
Bjorn Lisper
Hubert, I was going to let this whole SMC discussion slide on by, but
your post finally convinced me otherwise.
First, you meant "y" := y + 2, not "x", right?
Second, it seems like only yesterday when we (the royal we) CPU
architects were so concerned with trying to narrow the semantic gap
between what a programmer was trying to express and what the
machine/language was coercing her into. Languages like Ada and
machine architectures like capability machines were intended to
address this perceived need. The basic idea is that (oversimplifying
drastically) in a small programming environment (you at your
standalone workstation) anything goes in terms of protection and
language type checking, etc.
The interesting domain is that of programming-in-the-large, with
large teams of programmers producing hundreds of thousands of lines
of code (SDI/Shuttle). There, extracting the last nano-mip of
performance is of far less interest than in avoiding bugs and
producing maintainable code, and I still believe that supporting that
environment is a far more important challenge to architects than
worrying about SMC could ever be. And I don't think (heavy sarcasm
here) that UNIX/C is the ultimate answer to that environment.
Bob Colwell mfci!col...@uunet.uucp
Multiflow Computer
175 N. Main St.
Branford, CT 06405 203-488-6090
The Manchester University ATLAS, on which I cut my teeth (violins),
included in its standard (ABL) Assembler a most extraordinary example of
SMC. Each 48-bit instruction consisted of a 10-bit opcode, two 7-bit
register numbers and a 24-bit address. If the MSB of the opcode was 1,
the operation was an extracode, essentially one of 512 hard-wired
subroutines [trig fns, complex arith, double arith, tape motions, I/O, etc,
and how 512 such subroutines were squeezed into 4096 words, shared with a
compiler, supervisor, test routines etc is another gory story]. If it was
0, then the next bit was ignored, and the following two bits determined the
type of operation (00 illegal, 01 integer, 10 test, 11 floating).
ABL included 128 user-settable flags. Where were these stored?
Where else but in the ignored bits of 128 consecutive instructions within
ABL itself which happened not to be extracodes. I remember being amazed
by the discovery. I'm still not sure whether one should be appalled, or
whether one should marvel at the ingenuity. All those people who were
recently distrustful of an optimising assembler would have been really
dubious about a self-modifying assembler!
I once, and only once, wrote SMC myself. Also on ATLAS. It was a
self-initialising random number generator. First time through, it
initialised itself, either from the clock or from the user-supplied seed,
then pulled the real code in on top of itself. Must have saved 3us per call,
may well have amounted to 0.1% of the total run-time in heavy simulations.
I was proud of it at the time.
ATLAS would be a fine candidate for Bob Webber's collection of
historic computers for which there ought to be simulations. I still have
a cupboard full of machine-code listings of most of the important software.
<Reminiscence mode off>
--
Andy Walker, Maths Dept., Nott'm Univ., UK.
a...@maths.nott.ac.uk
The HP2100 became the HP1000 probably at least 10 years ago. The E- and F-
series 1000's retained model numbers starting with 21, such as 21MX.
I worked last Summer in HP1000 software support and got to see how the
support works. I also took a class on Micorprogramming at CSU Chico this
past Spring in which we used an HP1000E 21MX (of 1978 vintage).
Anyway, the HP2100 software still runs on the 1000 as-is. But the 1000 was
redesigned into the A-series in the mid 80's in order to maintain a lead
in the real-time systems market.
I'm not trying to be defensive of a system that, at the time, earned the
comments Bjorn gave. It just sounded like he wanted to see improvement.
Below are some places where what he wanted has been done.
lisper...@CS.YALE.EDU (Bjorn Lisper) writes:
> * It did have time-sharing, but the memory was statically partitioned and
> the only way to change the partitioning was to generate a new copy of the
> operating system!
>
> * Max address space for a program 64K. No virtual memory, so if you needed
> more then explicit segmentation with code overloading was the only way to
> go.
RTE-A (Real Time Executive, A-series) has multitasking, multiuser, and virtual
memory. Yes, things have changed.
> * Dumb terminal drivers that didn't recognize ^S & ^Q, not to mention the
> nifty block transfer protocols their own 264x series terminals were
> capable of (which, amongst other things, made these terminals' function
> keys useless).
My project last Summer was to work on additional docs for the multiplexer
(8-port serial board). The dumb-terminal drivers are only included now
for backward compatibility. The newer drivers have enough features
that my "supplementary" documentation grew to 70 pages (including
illustrations).
> * A contiguous-storage oriented file system that frequently lead to
> disk space fragmentation problems. Whenever that happened you had to
> switch to single-user mode and run a disk compaction program...
That sounds pretty familiar. I remember the disk compaction program but
I didn't have fragmentation problems. I can't say for sure what, if anything,
has changed in this area.
They've made it look more like Unix in this respect. It has a heirarchical
file system with a Unix-style shell. (If you're used to Unix, though, don't
expect it to be identical. It's nice compared to the old FMGR but Unix gurus
would be disappointed.)
> [...] A not too unusual comment after a bug report was: "this
> bug will not be fixed"! I guess HP didn't put in too much manpower into
> supporting this system. I also guess that they were right in not doing so...
I can think of a few circumstances where they might not want to fix something
that will take forever to repair when a workaround is available. The main
point here is that those people do care (I know, I worked with them) - they
just have finite resourses, like the rest of us. However, they do have
adequate manpower for most problems.
------------------------------------------------------------------
Ian Kluft RAS Lab
UUCP: hplabs!hprasor!kluft HP Systems Technology Division
ARPA: kl...@hpda.hp.com Cupertino, CA
------------------------------------------------------------------
Stupidity?!?? They didn't have stacks in those days. It probably seemed
like a pretty cute idea at the time.
I had the pleasure of spending many, many hours hacking on one of these
machines during my undergrad days. At the Digital Systems Lab at MIT a
group led by grad student extraordinaire Brad Hampson and fellow hacker
Don Slutz wrote a Multics clone in HP2100 assembler called DSL/TSS. It
had dynamic linking, a Runoff-type text formatting capability, and networking
to a bunch of microcomputers--mostly 8080s. The only thing we did not have
was any languages--just assembler. As a consequence the system ran like a
bat out of hell, and we never really noticed the fact that we only had 32K
of memory. This was around 1975-76. Now for my absolutely true story:
In order to receive my bachelor's degree, I was assigned the job of completing
Dave Presotto's bachelor's thesis. (I can say this now because Dave is
currently wildly successful at Bell Labs.) The mission: to build a micro out
of the new 6800 processor from Motorola and create software development tools
for the 6800 which would run under DSL/TSS. I wrote tens of thousands of
lines of HP2100 assember, including a cross assembler for the 6800. But the
thesis deadline was approaching. In one all-nighter, I typed into the
DecWriter II some 1500 lines of assembler which constituted the expression
evaluator for the cross assembler. I assembled it and ran my test. The
entire expression evaluator worked perfectly on the very first attempt, except
that it bombed out when attempting to return to the calling routine. It turned
out that I had JMP'ed to the evaluator instead of JSR'ing. Since the JMP did
not deposit the return address in the entry point, the JMP indirect through the
entry point (which is how you do a return on a 2100) got a return address of
zero.
Ah, for the good old days of ASR 33s and punched tape. Sigh...
Someone once told me that the 2100 architecture was the result of someone's
masters thesis at Stanford. I have no idea who.
Dave Emberson (d...@sun.com)
A naive (and not rhetorical) question: what evidence is there to indicate the
degree to which "narrowing the semantic gap" with capability machines and the
like would improve the productivity of programmers or the reliability of
programs, and to which other techniques (e.g., making a very fast conventional
machine, possibly but not necessarily using RISC technology, and using that
speed to improve the programming environment with better software) achieve the
same goal?
The main point I'm trying to make is that the PDP-10 gives us the option
of storing the return PC in memory, in an accumulator, or on the stack,
allowing the programmer to choose whichever option is best suited to the
task at hand. Having a small impure section of the program is reasonable
under certain circumstances.
--
+-----------------------------------------------------------------------------+
| TYMNET: JMS@F29 UUCP: {ames|pyramid}oliveb!tymix!antares!jms |
| INTERNET: JMS%F29.T...@Office-1.ARPA PHONE: Joe Smith @ (408)922-6220 |
+-----------------------------------------------------------------------------+
I have located errors in programs which would have been difficult to find
otherwise.
--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hru...@l.cc.purdue.edu (Internet, bitnet, UUCP)
Some protection and tracing features are MUCH slower in software.
These features are also useful in writing wonderful debuggers. You
need a bus snooper or ICE with an 8088 to achieve debugger features
that can be done in software on the 68020 or 80386. And per-task
memory protection is a big win in any environment, but impossible in
software.
Peter Desnoyers
pe...@athena.mit.edu
Peter Desnoyers
pe...@athena.mit.edu
eddie!athena!peter (I think)
The original HP2116 was designed by a company that was bought by HP, I believe.
(Digital something-or-other, but not "Equipment Corp").
Its instruction set is a direct ripoff, in almost EVERY respect, of the PDP8.
The instruction word was extended to 16 bits, which gave an extra bit for
op-code, an extra bit for accumulator specification (it had 2), and 2 extra
bits for address offset. Like the PDP-8, it had no subtract instruction. You
had to negate and add. Like the PDP-8 it had an operate instruction. Unlike
the PDP8, it had two separate, identical shift fields, instead of a bit to
say shift twice. The I/O system is pretty similar. And so on...
The original 2116 was built with CML logic, a kind of crude ECL. The 2116
ran very fast for its day; its sucessors were not as fast as the original.
The packaging was amazing- it could take a LOT of punishment, and keep on
ticking. This is part of the reason HP has a reputation for reliability.
The I/O system on the HP21xx was fairly simple; read or write, and strobe a
bit or two while you where at it. The I/O addresses were pre-decoded at each
slot; a six bit I/O address went through two 3->8 decoders, and these decoded
outputs were distributed to each card in an x-y matrix fashion, so an 'and'
gate on each card was sufficient for address recognition. Actually, two sets
of decoded addresses were distributed, so that a card could recognize two
consecutive addresses. This is so that complex I/O devices could occupy two
slots, but have the same address. This pre-decoded I/O scheme is what led to
the Apple II I/O structure.
--
{decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
OK, maybe I didn't make myself clear. I'm not referring to features you find
on many random processors or systems out there, such as memory management units
or "trap on transfer of control/trap on reference to particular location"
breakpoints. I'm referring to the sort of architectural features you find in
machines that, say, require references to objects to go through a "descriptor"
for the object, to do bounds checking and the like.
One instantiation of the question would be "are you better off doing that or
having a fast machine plus a compiler that can e.g. generate bounds checking
code but avoid doing it in some cases where it's 'known' that you don't need
it?" For instance, don't bother doing it in
double a[SIZE], b[SIZE], c[SIZE];
for (i = 0; i < SIZE; i++)
a[i] = b[i] + c[i];
(or the FORTRAN, or Ada, or COBOL, or... equivalent), since you know "i" will
always be within the bounds of all the arrays.
Not "do you think you'd be better off" or "do you have an analysis that makes
it 'intuitively obvious' that you'd be better off", but "has anybody compared
actual implementations of the two approaches, and concluded that, with
everything taken into account, you're better off with one or the other?" I.e.,
taking into account the fact that microcode and hardware is like software in
that it is possible for it to be buggy (either designed incorrectly or
implemented incorrectly), and taking into account the time it takes to design
that system, and....
On the Blit hardware? I doubt it. In any case, if the implementation
uses the full memory bandwidth (a more precise form of my earlier statement),
then it *is* fundamentally impossible to do better. The ways around this
generally involve structuring the hardware in such a way that the hardware
has access to memory bandwidth that cannot be used by the software. Other
things being equal (which they often aren't, of course), one may question
whether this is really such a smart idea.
>... What about a CPU with a 16 bit bus competing with a blit
>chip with 64 bit access to the frame buffer?
One wonders whether it might be a better investment to put in a wider CPU,
which should speed up *everything*, not just BitBlt.
>What about a CPU that must
>fetch instructions on the same bus used for data?
What about it? Most modern CPUs have ways around this on at least a small
scale. The Blit CPU is a lousy old 68000, not even a 68010; by using fun
instructions like move-multiple, one can build code (for the usual BitBlt
cases) that needs instruction fetches relatively rarely.
I agree that it's possible to break the hardware in such a way that a
software BitBlt is inherently slow. I prefer unbroken hardware myself.
I wouldn't be surprised if DEC is still making PDP-8s, which did things
that way.
First, that's only true if either the bitblt is being done to *and* from
memory not accessed by the CPU, or the memory bandwidth is high enough to
handle both the CPU and the bitblt DMA cycles. I'd guess that's rare; we
seem to be afflicted with memory-bound CPUs in a lot of systems.
But more importantly, in many cases you can't do anything else because
either (a) there's only one process running or (b) you can't afford the
context-switch time. Unless you get a bitblt operation that takes longer
than *two* context switches (out and back), it costs more. For the case
where you're bitblting characters (probably the most common case), you will
lose big.
> If dynamic-compilation is that much faster, then you would have expected
> someone to have written a program that does uses this technique on the Amiga,
> rather than using the blitter. So far, I haven't seen any...
Ummm...but what does this prove? Not much, I think, until someone actually
tries it and finds out whether it's slower. Given that the Amiga already
has a useful bitblt (of whatever form), there's no great incentive to write
another.
As far as I know, there is no evidence that you would necessarily
find compelling, but then I could say that same thing about almost
everything else in this field, too. There are, on the other hand,
some good reasons to believe that we can do better than imperative
languages running on essentially unprotected architectures.
I know I can't do this topic justice in this forum, but here's my
quick take on the subject. Think about the different computer
languages you have used, and what was good or bad about each. Backus
argued (in his famous paper on functional languages) that one of the
reasons that Lisp is a good language is that you don't have to
mentally execute a program (as you do with imperative languages) in
order to convince yourself that it will do what is wanted. The
language allows you to more closely express what is desired, so you
test correctness by inspection. And yes, there are cases where it's
more obvious/easier-to-express something in C than Lisp. But both
examples serve to illustrate the point -- you, as a programmer, have
some virtual entity you want to realize (data structure or
algorithm), and the closer you can get to realizing exactly what you
have in mind, the more likely the code is to be correct,
maintainable, and understandable (and the smaller the semantic gap
between what you want and what you can express).
That's partly an allegory. The capability people argue that the same
thing extends into all areas of computer systems. Recall the classic
arguments about turning off runtime bounds checking to reclaim that
lost performance -- why should a programmer, in the best of all
possible worlds, have to worry about things like that? It doesn't
help any of the major productivity metrics.
Say what you want about Ada on the 432 (and I've said plenty
already), as a programming environment (forget the speed problems
temporarily) it was really nice. Your program had access to only the
data structures you specified, and only in the ways you specified,
and if your code screwed up, the machine could tell you where and in
what way. To me, the hardest bugs to find are those where you fix
one bug (in someone else's code, of course) and inadvertently break
it in some obscure way such that something completely unrelated is
getting trashed. For this to even be possible (in my opinion) means
that the programming environment is lacking something crucial,
notwithstanding that about 95% of all programmers on this planet see
just such an environment. The environment is failing to express what
the programmer wanted, and it's a combined failure of the machine
architecture, the language, and probably the OS too. The semantic
gap argument says that the more the desired and achieved environments
differ, the larger the quantity of bugs, and the worse their
obscurity will be.
I know you're tempted at this point to say that even if one grants
this as a shortcoming of current architectures/environments, there
have been no successes so far at addressing it. That's another topic
of conversation, I think; all I'm trying to do here is offer some of
the justification for why a lot of research has gone into other ways
to compute (that don't have sheer instrs-executed-per-unit-time as
their primary figure of merit).
All the strong-type-checking stuff built into some languages has the
same motivation as above.
For me, the bottom line is that, as usual, there aren't any easy
answers (because if there were somebody would've patented them by
now), but we shouldn't lose track of the corresponding questions just
on that basis. The problem is getting worse, after all -- more, not
fewer, lives currently depend on the correctness of software than
ever before, and that trend will continue (fly-by-computer airplanes,
dynamically-unstable planes, military systems, ...).
I assume that the reason people like you and me are concentrating on
performance is that that's what sells. I don't think that needs
defending, but I also see few reasons to believe that sheer
performance alone will provide the cures to the problems I tried to
outline above.
Uh, using what for memory bandwidth? A good implementation of BitBlt,
whether in software *or* hardware, will run the memory flat out, leaving
nothing for "servicing other processes". (Caches help a lot on instruction
fetches, but data has this annoying tendency to require real memory fetches.)
If your CPU is sitting there waiting for memory, it might as well be doing
the BitBlt itself.
>...the Amiga blitter can do
>logical operations (on three inputs) just as fast as a normal memory
>copy - It'd be pretty hard to hack some 680x0 code to XOR three inputs
>and write to three outputs as fast as it can move memory.
True, but who cares? The overwhelmingly common cases of BitBlt are doing
an operation like writing a character, where the software overhead of
deciding what to do totally dominates doing it, and bulk movement of bits,
where a straight copy operation can run memory-bound on a well-built box
with any good implementation.
>Also, the
>overhead is very low compared to the overhead involved in dynamic compilation.
See previous paragraph. The Blit code does not do dynamic compilation for
the small cases where overhead is dominant anyway (when you're drawing a
character, you are only moving a few words of data, and figuring out which
words to move is much harder than actually doing it, regardless of whether
the actual doing is hardware or software), only for the big ones that can
amortize the overhead well. BTW, the overhead is smaller than you think.
>If dynamic-compilation is that much faster, then you would have expected
>someone to have written a program that does uses this technique on the Amiga,
Yes. Hence my harsh words about the competence of the people involved.
Note that doing a really good software BitBlt is A LOT OF WORK -- it's not
the sort of thing an amateur knocks off in an evening. The techniques are
not that well known (especially in the micro community, which is notorious
for its ignorance of the state of the art -- the people at Commodore most
likely had no idea it was even possible to do a fast software BitBlt).
Note that some of the earlier Suns had quite a bit of hardware BitBlt
assist, and the newer ones don't. Sun learned. Commodore will, someday.
You are asking for way too much, Guy. Computer system designers
can't even agree on performance metrics, and performance is probably
the easiest thing to measure about a computer system (as compared to,
say, reliability or programmer productivity).
Even the units in which programmer productivity is usually measured
are suspect -- lines-of-debugged-code per person per unit time????
My intuition is that the practical answer is obvious -- if you have a
machine that is 2x the performance of mine, the prospect for my
environment being more "productive" than yours (without even being
able to quantize the difference) isn't going to help me sell very
many machines. Since this is a user-perception problem, technical
answers won't necessarily help, either.
But my intuition is also that, for the same reasons that we
instinctively reach for the best tools for the job at hand (C,
assembly, Lisp, Ada, etc.), we would do considerably better as
programmers if the "thing" (including architecture, language, & OS)
executing our code matched our expectations as closely as possible,
and with few surprises. And further, if it made sure we knew the
nature of whatever surprises occurred, rather than (a C example)
having an array-pointer-out-of-bounds happen to trash a scalar value
that was coincidentally declared nearby. I don't see higher
performance as a very good means of preventing errors like that.
I think the reason you won't find many academic or formal studies
attempting to quantify the benefits of object-orientation (or other
higher-productivity programming environments) is that such studies
would be fraught with difficulty and expense, and would be unlikely
to yield completely unambiguous results. Hell, there are still
people who think assembly language programming is better; how would
you prove to them they're wrong? (If they smoke, too, you'll know
you're in big trouble if all you've got is a rational argument on
your side.)
I won't comment on Commodore's software guys (I'm typing this on an
Amiga, so, depending on when I last found a bug in the OS, I'm either
in complete agreement or violent disagreement), but there's one big
advantage of a bitblt chip over even a tight assembly-language
memory-move loop (using DBcc, no less): instruction fetch.
On newer processors such as the 68020 and 68030, instruction fetch
during a loop is no problem, since such a small tight loop is almost
certain to be in the cache. But on older processors such as the 68000
or 68010, you would burn up 3/4ths of your bus bandwidth just fetching
instructions! (no cache).
Also note that even fetching instructions out of cache, it still takes
at least the 68020 some time to execute them (presumably, the 68030's
pipelining will keep instruction fetch & execution overlapped). The
hardware blitter, of course, will never have instruction fetches...
Given the technology of 1983-1984, a hardware blit chip made a lot of
sense, then. Now.... well, the 68030 costs a bundle, so why pay more
for a blit chip that can't go any faster anyhow?
--
Eric Lee Green ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg
Snail Mail P.O. Box 92191 Lafayette, LA 70509
PC Pursuit, n., a service specializing in mangling your postings,
by eating characters so you can't see what you're typing.
That's exactly what the Amiga does. There is 512K of RAM dedicated to
screen operations, while the OS and (with xtended RAM) user programs
run out RAM or ROM which is on a seperate bus inaccessible to the
blitter. Thus, while the blitter runs, user programs are free to run
without bus contention. In addition, the blitter takes advantage of
the fact that the 68000 only uses half the available memory bandwidth,
and even when the 68000 is accessing "chip" RAM (blitter-accessible
RAM), the blitter can still run at half-speed until the 68000 goes
somewhere else. All in all, a very useful and elegant design, given
its age ('83-'84 or therebouts), and the low cost of implementation
($650 for a complete machine, with disk drive, whereas 68020's and
68030's would cost almost the same as the complete A-500 after
production costs and two levels of markups).
> context-switch time. Unless you get a bitblt operation that takes longer
> than *two* context switches (out and back), it costs more. For the case
> where you're bitblting characters (probably the most common case), you will
> lose big.
That's one of the major flaws that they found in the Amiga's handling
of text i/o... Charlie Heath of Microsmiths replaced the bitblt with
plain old memory moves, and is almost twice as fast.
However, I still would not want to use a plain old 68000 machine which
had to use a software loop to scroll a 32K bitmap.... I've seen Macs
and ST's, which have much simpler hardware and software which should
be faster than the Amiga's message-passing OS (which requires a
context switch to send the text to the console.device process , then
another context switch to get back to the user process)... and the
speed of window management and scrolling simply cannot compare. That
hardware makes more difference than you'd think, given all the other
factors involved.
--
Eric Lee Green ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg
Snail Mail P.O. Box 92191 Lafayette, LA 70509
PC Pursuit: A diabolical conspiracy to cause spelling flames upon the
net, by rendering the typer inable to read what he has written.
What *was* interesting was that some folks at Stanford got sick of the
stackless operation, so they added one! Made things a lot nicer. Too bad
it didn't get into the 'real' product.
--
OOO __| ___ Peter Holzmann, Octopus Enterprises
OOOOOOO___/ _______ USPS: 19611 La Mar Court, Cupertino, CA 95014
OOOOO \___/ UUCP: {hpda,pyramid}!octopus!pete
___| \_____ Phone: 408/996-7746
I know of no programmer who thinks assembly language is better for all
kinds of programs, nor do I know a good one who feels it is *never*
better for *any* job. Until I can buy a compiler -- for any language --
than can generate code that runs as fast as I can write by hand, I will
use assembly language where speed is essential. You will, indeed, have
a hard time proving to me that I'm wrong in using the tool that best
fits the requirements of the job.
--
Ed Nather
Astronomy Dept, U of Texas @ Austin
{backbones}!{noao,ut-sally}!utastro!nather
nat...@astro.as.utexas.edu
My counterargument is that it is almost as much of an imposition on the
programmer to have such checks done at runtime as it is to have them not
done at all. Given the impossibility of complete testing, there is no
way to guarantee such tests will catch a bug during development. This
means that the programmer has to go through the same complicated exercise
of trying to assure himself that the problem will never happen. What is
really wanted is a way to check these properties at COMPILE TIME... in
which case the runtime checks become largely superfluous anyway.
Can it be done? Well, in one sense the answer is clearly yes, because a
proof of program correctness has to include it, and we know that automating
such proofs is possible (although seldom practical at the moment). The
question is whether it can be done with practical, affordable machinery
without crippling the language. My own long-held conjecture is that the
answer is "yes", but I don't have proof of that yet.
Let us not forget, also, that self-modifying code wasn't uncommon in
bootstraps, back in the days when ROMs were expensive or unavailable.
The disk bootstrap for a late-model PDP-8 was two instructions, so
brief that nobody ever bothered putting it in ROM, and it was about
as self-modifying as you could get: power-up-reset happened to leave
the disk controller in just the right state for reading block 0 into
memory starting at 0, so the bootstrap -- toggled in at a specific place
in low core -- was "kick the disk controller; branch to self". It
got overwritten as block 0 came in, and of course block 0 was carefully
crafted to catch control as the "branch to self" was overwritten!
Just because you don't know any doesn't mean they don't exist. Look
back about four years ago in all the standard trade rags and you'll
find a letters-to-the-editor war waged on this basis. Directly
equivalent to the "go-tos suck" war in CACM several months back.
Anyway, that ain't what I'm talking about.
I'm trying to draw what seems to me to be a direct analogy. OK, it
won't work on you if you don't know any assembly language religious
fanatics (and I grant there aren't many left.) What I'm saying is
that no reasonable proof can be offered to cause someone to switch
away from the tools they're using, and I think the same phenomenon
applies to programming environments in the large.
Heck, for all I know, your situation might be a case in point.
Suppose you go to your boss (pretend you're in a commercial situation
to make this cleaner) and tell him "hey, I can meet the target specs
you asked for by coding in C over a period of 2 months, but I can
exceed them by 15% by coding some of it in assembler (and it'll take
a little longer)". Your job might be to get the initial prototype
running, but his is to make sure he ends up with a product that he
can maintain, extend, and understand, because you, the programmer,
are statistically likely to leave within 2 or 3 years. So he may
tell you to punt the extra performance. The issue here is on what
grounds, and in what terms, the two of you can discuss which path to
take. It's not as simple as "hand assembly code is faster than
compiled code".
We probably agree that the compiler should check as much as it
possibly can; the earlier an error is detected, the less the
ambiguity about what is wrong and the cheaper the cure. But there is
an awful lot the compiler can't know about -- programs that use input
data, generate pointers, or do just about anything else that's
interesting, are going to do many operations that are difficult for
the compiler to validate at compile time.
I'm not sure I catch your drift on the imposition of runtime checks.
To me, the best human debuggers have the ability to figure out the
fastest which assumptions are being made that are wrong, thus getting
to the heart of some problem the quickest. If I have a program bug,
I want the machine to express as directly as possible the intent of
my program so that I can narrow the bug-space down to my code alone.
If the machine allows a change to one variable (an array, say) to
affect some other unrelated variable, then it is not conforming to
the intent of my program. In fact, it is not conforming to anything
useful at all, since nobody would argue that it is useful programming
practice to ever do such a thing on purpose (I hope?). Given that,
the fact that the machine can do such a thing, let alone do it as a
default, is what I'd label a shortcoming in the basic architecture.
One we've all long ago learned to live with, to be sure, but it's not
something to be proud of, at very least.
>Can it be done? Well, in one sense the answer is clearly yes, because a
>proof of program correctness has to include it, and we know that automating
>such proofs is possible (although seldom practical at the moment). The
>question is whether it can be done with practical, affordable machinery
>without crippling the language. My own long-held conjecture is that the
>answer is "yes", but I don't have proof of that yet.
I sure hope you're right. In fact, if progress towards this goal
becomes evident, I'd propose we start discussing ways of turning
architecture towards better ways of supporting this instead of
throughput or programming environments.
But the 68000 (uncached) version would use 3/4ths of that bandwidth
to fetch INSTRUCTIONS. Also note that the bandwidth of modern 100ns
memories is much greather than a 8mhz 68000 can take advantage of...
even
when the 68000 is on the bus 100% of the time that it can be, you
still have half that memory bandwidth left for blitter operations etc.
(a fact that the designers of the ST use to do their video refresh).
Given the cost constraints, and the state of technology in 1984, it
made a lot of sense to have a hardware blitter. The fact that the Blit
terminal did fast blits without a blitter probably has more to do with
the fact that a) the machine was designed before modern high-speed
DRAMs, when 250ns RAMs were fast, and b) the machine had a much higher
resolution screen than the Amiga, meaning that much more bus bandwidth
was taken up with video refresh (thereby improving the desirability of
caching the loop, since you couldn't do data accesses any faster
anyhow).
> If your CPU is sitting there waiting for memory, it might as well be doing
> the BitBlt itself.
At least with the 68000 and modern DRAMs, the memory spends most of
its time waiting for the CPU!
> deciding what to do totally dominates doing it, and bulk movement of bits,
> where a straight copy operation can run memory-bound on a well-built box
> with any good implementation.
Memory-bound? Yes, if you're using a 68020 or 68030 with a tight loop
that'll fit into their internal cache. But the 68000 would spend half
of that bandwith simply fetching instructions, not moving bits....
> the sort of thing an amateur knocks off in an evening. The techniques are
> not that well known (especially in the micro community, which is notorious
> for its ignorance of the state of the art -- the people at Commodore most
> likely had no idea it was even possible to do a fast software BitBlt).
True, the majority of the micro world thinks multitasking is a neat
idea, message passing is what secretaries do, etc. But first time I
ever heard the designers of the Amiga accused as such (the Amiga, BTW,
was NOT designed by Commodore). Considering that they implemented such nifty
state-of-the-art ideas as a message-passing multi-tasking OSetc. (only
to ruin it all by commissioning a cruddy filesystem), and considering
that they all had Suns on their desktops when they were designing it,
it sounds like you're being harsh for no good reason, considering the
design constraints (512K RAM, under-$1500 price, etc.
> Note that some of the earlier Suns had quite a bit of hardware BitBlt
> assist, and the newer ones don't. Sun learned. Commodore will, someday.
Earlier Suns were based upon 68000/68010 techynology, where hardware
bitblt's make a lot of sense. Later Sun's, based upon 68020s, don't
need the hardware assist (see the beginning of this message).
Unfortunately, Commodore probably will retain their blitter even when
they move into 68020/68030 machines, for backward-compatability
reasons (there's a helluva lot of programs out there, mostly games,
directly accessing that blitter).BTW, games programmers are known for
their dedication to speed... and most of the games programmers that I
know have made extensive use of the blitter whenever it made sense
(when moving large amounts of data, when setup time is no longer the
dominating factor).
--
Eric Lee Green ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg
Snail Mail P.O. Box 92191 Lafayette, LA 70509
PC Pursuit: Or, How to Eat the Typist's Echo in Three Words or Less.
I used to work on a CDC 170 state mainframe, and got to be pretty good at
writing assembly language programs for it (natural skill [blush] plus a very
simple but powerful instruction set made for fun programming). A friend and
superior (Marc Furon; I knew him before I started working with him) was
*very* good with it, definitely one of the better COMPASS programmers around.
He maintained, I believe him (hey, I trust the guy's opinion), that the
FORTRAN compiler could generate better code than he could, sometimes.
'Course, the MSC 5.x compiler will also generate better code than I will, at
times (other times, it's so *stupid* 8-)). Compiler technology is getting
*very* good, and will, for a time, continue to improve. I think if you wait
a few years (or decades), then we will probably have a compiler that won't
need the noalias keyword: it will be able to figure it out from
comprehensive code analysis and a good understanding of the machine.
Fun thought, ain't it?
--
Sean Eric Fagan | "An Amdahl mainframe is lots faster than a Vax 750."
se...@sco.UUCP | -- Charles Simmons (ch...@amdahl.uts.amdahl.com)
(408) 458-1422 | Any opinions expressed are my own, not my employers'.
The answer is bandwidth. The CPU does not have to stop filling the instruction
pipeline when it accesses/writes data.
--
Ge' Weijers, Informatics dept., Nijmegen University, the Netherlands
UUCP: {uunet!,}mcvax!kunivv1!hobbit!ge
I fear that the assembler jock would have done the innermost loop
first, said "look how much better assembler is than C," and gone on.
Obviously the Xenix/286 compiler doesn't generate optimal 80287 code,
but I never would have been willing to spend time on the algorithm if
changes took an hour of typing to enter.
--
bill davidsen (we...@ge-crd.arpa)
{uunet | philabs | seismo}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me