reducing number consing

Barry Margolin

unread,

Dec 14, 2003, 2:02:12 AM12/14/03

to

In article <v5PCb.11119$Ug6...@twister.nyroc.rr.com>,
Cliff Crawford <cj...@cornell.edu> wrote:

> I'm trying to write a simple neural network simulation in Lisp, but
> I'm having trouble with number consing in the inner loop in the
> training function. According to SBCL's profiler, the function
> feed-forward (see below) conses 32 bytes every time it's called, but
> I'd like it to not cons at all if possible. Does anyone see what's
> wrong with it? TIA...

There's probably some boxing and unboxing of floats going on. Maybe
declaring the type of SUM would allow the computations in the inner loop
to be open-coded.

> (deftype net-vector () `(simple-array double-float))
> (deftype net-matrix () `(simple-array double-float 2))
> (deftype net-activation-function () `(member :linear :logistic :exp :tanh))
>
> (defun feed-forward (layer1 layer2 weights biases activation-function)
> (declare (optimize (speed 3) (safety 0))
> (net-vector layer1 layer2 biases)
> (net-matrix weights)
> (net-activation-function activation-function))
> (loop :for i :below (length layer2) :do
> (let ((input (+ (aref biases i)
> (loop :with sum = 0d0
> :for j :below (length layer1) :do
> (incf sum (* (aref weights i j)
> (aref layer1 j)))
> :finally (return sum)))))
> (setf (aref layer2 i) (case activation-function
> (:linear input)
> (:logistic (logistic input))
> (:exp (exp input))
> (:tanh (tanh input)))))))

--
Barry Margolin, bar...@alum.mit.edu
Woburn, MA

Alexey Dejneka

unread,

Dec 14, 2003, 2:00:46 AM12/14/03

to

Cliff Crawford <cj...@cornell.edu> writes:

> I'm trying to write a simple neural network simulation in Lisp, but
> I'm having trouble with number consing in the inner loop in the
> training function. According to SBCL's profiler,

Implementation-specific questions are better asked on its
mailing-list.

> the function
> feed-forward (see below) conses 32 bytes every time it's called, but
> I'd like it to not cons at all if possible. Does anyone see what's
> wrong with it? TIA...
>

> (deftype net-vector () `(simple-array double-float))
> (deftype net-matrix () `(simple-array double-float 2))
> (deftype net-activation-function () `(member :linear :logistic :exp :tanh))
>
> (defun feed-forward (layer1 layer2 weights biases activation-function)
> (declare (optimize (speed 3) (safety 0))
> (net-vector layer1 layer2 biases)
> (net-matrix weights)
> (net-activation-function activation-function))
> (loop :for i :below (length layer2) :do
> (let ((input (+ (aref biases i)
> (loop :with sum = 0d0
> :for j :below (length layer1) :do
> (incf sum (* (aref weights i j)
> (aref layer1 j)))
> :finally (return sum)))))
> (setf (aref layer2 i) (case activation-function
> (:linear input)
> (:logistic (logistic input))
> (:exp (exp input))
> (:tanh (tanh input)))))))

Put (DECLAIM (INLINE LOGISTIC)) before definition of LIGISTIC. You
/may/ also need to slightly change FEED-FORWARD:

(defun feed-forward (layer1 layer2 weights biases activation-function)
(declare (optimize (speed 3) (safety 0))
(net-vector layer1 layer2 biases)
(net-matrix weights)

#+nil (net-activation-function activation-function) ; <---

)
(loop :for i :below (length layer2) :do
(let ((input (+ (aref biases i)
(loop :with sum = 0d0
:for j :below (length layer1) :do
(incf sum (* (aref weights i j)
(aref layer1 j)))
:finally (return sum)))))
(setf (aref layer2 i) (case activation-function
(:linear input)
(:logistic (logistic input))
(:exp (exp input))
(:tanh (tanh input))

(t (error-not-activation-function input)) ; <---
)))))

--
Regards,
Alexey Dejneka

"Alas, the spheres of truth are less transparent than those of
illusion." -- L.E.J. Brouwer

james anderson

unread,

Dec 14, 2003, 7:10:34 AM12/14/03

to

Cliff Crawford wrote:
>
> I'm trying to write a simple neural network simulation in Lisp, but
> I'm having trouble with number consing in the inner loop in the

> training function. According to SBCL's profiler, the function

> feed-forward (see below) conses 32 bytes every time it's called, but
> I'd like it to not cons at all if possible. Does anyone see what's
> wrong with it? TIA...

the mystery for this reader is why it should cons only 32 bytes each time it
is called, where the arrays are of unspecified bounds. this would imply that
the external calls on input are already inlined and do not cons, and that the
array references also do not cons, and suggests that it is consing
temporaries. 32 bytes at 16 per double would imply two values. perhaps input
and sum. should that be true, perhaps despite the compiler's evident
perspicacity, one needs to inform it that these two variables have dynamic
extent. in whatever manner it needs to be told such things. in some
implementations it would look like:

(defun feed-forward-o (layer1 layer2 weights biases activation-function)

(declare (optimize (speed 3) (safety 0))
(net-vector layer1 layer2 biases)
(net-matrix weights)
(net-activation-function activation-function))

(let ((input 0.0d0)
(sum 0.0d0))
(declare (type double-float input sum)
(dynamic-extent input sum))

(loop :for i :below (length layer2) :do

(setf sum 0.0d0)
(dotimes (j (length layer1))
(setf sum (+ sum (* (aref weights i j) (aref layer1 j)))))
(setf input (+ (aref biases i) sum))

(setf (aref layer2 i) (case activation-function
(:linear input)
(:logistic (logistic input))
(:exp (exp input))
(:tanh (tanh input)))))))

in mcl these declarations eliminate the consing for the temporaries. sbcl must
have somthing similar.

nb. i'm loop impared and can't suggest the formulation for such expressions

...

Tim Bradshaw

unread,

Dec 14, 2003, 10:30:53 AM12/14/03

to

* Cliff Crawford wrote:
> I'm trying to write a simple neural network simulation in Lisp, but
> I'm having trouble with number consing in the inner loop in the
> training function. According to SBCL's profiler, the function
> feed-forward (see below) conses 32 bytes every time it's called, but
> I'd like it to not cons at all if possible. Does anyone see what's
> wrong with it? TIA...

It is probably boxing a float for the return value (SUM). CMUCL
people will know better how to avoid this, but I think the answer is
block compilation, inlining, and/or aggressive ahead-of-time
declaration of the function type.

--tim

Joe Marshall

unread,

Dec 15, 2003, 10:27:55 AM12/15/03

to

Cliff Crawford <cj...@cornell.edu> writes:

> I'm trying to write a simple neural network simulation in Lisp, but
> I'm having trouble with number consing in the inner loop in the
> training function. According to SBCL's profiler, the function
> feed-forward (see below) conses 32 bytes every time it's called, but
> I'd like it to not cons at all if possible.

Why? Is GC an issue here?

Cliff Crawford

unread,

Dec 17, 2003, 3:46:08 PM12/17/03

to

I'm worried it will be, because in typical usage it will get called
100,000 - 1,000,000 times. It's certainly fast enough for my test
cases, though. Anyway, I'll go ask on the SBCL mailing list (I didn't
realize this would be an implementation-specific thing). Thanks for
all the responses.

--
Cliff Crawford *** cj...@cornell.edu

"Platypus? I thought it was pronounced platymapus. Has it always been
pronounced platypus?" -- Jessica Simpson

Raymond Toy

unread,

Dec 17, 2003, 5:08:47 PM12/17/03

to

>>>>> "Cliff" == Cliff Crawford <cj...@cornell.edu> writes:

Cliff> On 2003-12-15, Joe Marshall <j...@ccs.neu.edu> wrote:
Cliff> | Cliff Crawford <cj...@cornell.edu> writes:
Cliff> |
Cliff> | > I'm trying to write a simple neural network simulation in Lisp, but
Cliff> | > I'm having trouble with number consing in the inner loop in the
Cliff> | > training function. According to SBCL's profiler, the function
Cliff> | > feed-forward (see below) conses 32 bytes every time it's called, but
Cliff> | > I'd like it to not cons at all if possible.
Cliff> |
Cliff> | Why? Is GC an issue here?

Cliff> I'm worried it will be, because in typical usage it will get called
Cliff> 100,000 - 1,000,000 times. It's certainly fast enough for my test
Cliff> cases, though. Anyway, I'll go ask on the SBCL mailing list (I didn't
Cliff> realize this would be an implementation-specific thing). Thanks for
Cliff> all the responses.

Shouldn't you profile it with real stuff instaed of test code before
micro-optimizing this?

Luke Gorrie was saying that in one of his ethernet switch (?)
programs, he was consing for every single ethernet packet. It turned
out that it didn't really matter, and generational GC took care of the
consing quite well.

Ray

Luke Gorrie

unread,

Dec 17, 2003, 5:49:22 PM12/17/03

to

Raymond Toy <t...@rtp.ericsson.se> writes:

> Luke Gorrie was saying that in one of his ethernet switch (?)
> programs, he was consing for every single ethernet packet.

No need to rely on hearsay from me. TIME will say how much time was
spent in GC anyway (if you tear your eyes away "bytes consed" :-)

~20MB/sec of consing cost me <10% CPU, which was just fine with me.
YMMV.

Cheers,
Luke

Joe Marshall

unread,

Dec 17, 2003, 7:49:30 PM12/17/03

to

Raymond Toy <t...@rtp.ericsson.se> writes:

> Shouldn't you profile it with real stuff instaed of test code before
> micro-optimizing this?
>
> Luke Gorrie was saying that in one of his ethernet switch (?)
> programs, he was consing for every single ethernet packet. It turned
> out that it didn't really matter, and generational GC took care of the
> consing quite well.

With a good generational GC (and the appropriate tuning), you can cons
garbage at a phenomenal rate for practically nothing. One program
that I have chews through several gigabytes in a few seconds with very
little GC overhead. Basically, the generational GC kicks in, notices
that absolutely nothing points at the latest generation, and just
resets the consing frontier to the beginning.

This isn't to say that you can *always* get away with this, nor that
it is *never* worthwhile to handle memory management manually, but in
my experience it is almost never worthwhile worrying about consing
stuff that will be immediately thrown away.

Incidentally, if you are consing stuff that you are *not* throwing
away, then you will want to be concerned about the GC. A generational
GC will end up promoting the structures through the various
generations unnecessarily.

--
~jrm

William D Clinger

unread,

Dec 19, 2003, 3:44:01 PM12/19/03

to

Cliff Crawford <cj...@cornell.edu> wrote:
> | Why? Is GC an issue here?
>
> I'm worried it will be, because in typical usage it will get called
> 100,000 - 1,000,000 times.

At 32 bytes per call, that's 32 megabytes of short-lived data.
In all likelihood, you're worrying about less than one second
of gc time, probably much less.

Will

Luke Gorrie

unread,

Dec 19, 2003, 4:51:00 PM12/19/03

to

While we're being quantitative, in a typical SBCL setup this would
require about three garbage collections (one for each 12MB allocated),
and each would take something like 10ms on a 2Ghz pentium PC if the
data is very short-lived. So "probably less than one second" is
putting it mildly.

But if the urge to optimise memory management is strong, we're in
luck, because it looks like the garbage collector could be optimised
considerably further for these cases. And what could be higher in
optimisation-glory than hacking the GC?

I must disclaim that probably very few programs would benefit from
this, and one should check with TIME to see. But optimisation should
benefit programs that generate a truly phenomenal amount of garbage,
or programs don't work well with frequent pauses of ~10ms.

Joe Marshall's nice description was:

Basically, the generational GC kicks in, notices that absolutely
nothing points at the latest generation, and just resets the consing
frontier to the beginning.

But on CMUCL or SBCL, it seems to be more like this:

The generational GC kicks in, notices that nothing in older
generations points at the latest generation, scavenges static space
to update any pointers into the latest generation, and then just

sets the consing frontier to the beginning.

That is to say, the GC doesn't know whether anything in "static" space
(where non-movable objects can be allocated) has been modified, so
during each collection it "scavenges" that memory to update pointers
to moved objects. The other dynamic-space generations have a
"write-barrier" to keep track of what has and hasn't been modified,
but the static space does not.

Static space appears to typically be around 4MB, and a significant
amount of time seems to be spent scavenging it (based on profiling).

Ideas have been floated about fixing this, for instance moving most of
the objects in static space into a "tenured" generation of dynamic
space (where it gets the write-barrier) that doesn't get
collected. Whoever rolls up their sleeves and hacks it will have very
good karma indeed :-)

More details can be found in the cmucl-imp mailing list archives.

It also appears that on some systems (i.e. my laptop) the garbage
collector causes a great deal of kernel work. Smarter use of system
calls may fix this, e.g. coalescing mprotect()s of adjacent pages. You
can see if your system suffers this problem by running this in the
top-level:

(defun foo () (make-array 1024))
(compile 'foo)
(time (dotimes (i (expt 10 6)) (foo)))

On my laptop this reports that most of the time was spent in the
system (i.e. kernel), but on my desktop machine it doesn't. Both are
running Linux/x86, but different kernels and different CMUCL versions.

Cheers,
Luke

Joe Marshall

unread,

Dec 19, 2003, 6:30:49 PM12/19/03

to

Luke Gorrie <lu...@bluetail.com> writes:

> It also appears that on some systems (i.e. my laptop) the garbage
> collector causes a great deal of kernel work. Smarter use of system
> calls may fix this, e.g. coalescing mprotect()s of adjacent pages.

Some experiments have shown that checking the write-barrier in
software by doing a conditional branch on every write can outperform
hardware checking based on mprotect.

--
~jrm

Luke Gorrie

unread,

Dec 19, 2003, 7:10:59 PM12/19/03

to

Joe Marshall <prunes...@comcast.net> writes:

> Some experiments have shown that checking the write-barrier in
> software by doing a conditional branch on every write can outperform
> hardware checking based on mprotect.

Interesting. Do you have any more details?

In particular I wonder what the bottleneck was with hardware checking.

-Luke

Michael Livshin

unread,

Dec 19, 2003, 7:32:23 PM12/19/03

to

Joe Marshall <prunes...@comcast.net> writes:

> Luke Gorrie <lu...@bluetail.com> writes:
>
> Some experiments have shown that checking the write-barrier in
> software by doing a conditional branch on every write can outperform
> hardware checking based on mprotect.

that's very system-dependent. conventional wisdom holds, for
instance, that on Linux signals (and context switches in general) are
very fast, so there the mprotect-based approach can be a win. I'm not
sure how much one can trust conventional wisdom here, though. are
signals as fast on SMP systems as they are on single-CPU ones? is
this still true for reasonably current kernel/libc versions? I don't
know.

what was the experiments you talk about done on, if it's not a secret?

--
it takes more than not to remember how you did it the last time to be
innovative. -- Erik Naggum

Joe Marshall

unread,

Dec 19, 2003, 9:29:23 PM12/19/03

to

Luke Gorrie <lu...@bluetail.com> writes:

> Joe Marshall <prunes...@comcast.net> writes:
>
>> Some experiments have shown that checking the write-barrier in
>> software by doing a conditional branch on every write can outperform
>> hardware checking based on mprotect.
>
> Interesting. Do you have any more details?

@inproceedings{ hosking92comparative,
author = "Antony L. Hosking and J. Eliot B. Moss and Darko Stefanovic",
title = "A comparative performance evaluation of write barrier implementations",
booktitle = "Proceedings of the Conference on Object-Oriented Programming Systems, Languages, and Applications ({OOPSLA})",
journal = "SIGPLAN Notices",
volume = "27",
number = "10",
publisher = "ACM Press",
address = "New York, NY",
editor = "Andreas Paepcke",
pages = "92--109",
year = "1992",
url = "citeseer.nj.nec.com/hosking92comparative.html" }

> In particular I wonder what the bottleneck was with hardware checking.

Really slow interrupt handling.

--
~jrm

Joe Marshall

unread,

Dec 19, 2003, 9:33:31 PM12/19/03

to

Michael Livshin <use...@cmm.kakpryg.net> writes:

> Joe Marshall <prunes...@comcast.net> writes:
>>
>> Some experiments have shown that checking the write-barrier in
>> software by doing a conditional branch on every write can outperform
>> hardware checking based on mprotect.
>
> that's very system-dependent. conventional wisdom holds, for
> instance, that on Linux signals (and context switches in general) are
> very fast, so there the mprotect-based approach can be a win. I'm not
> sure how much one can trust conventional wisdom here, though. are
> signals as fast on SMP systems as they are on single-CPU ones? is
> this still true for reasonably current kernel/libc versions? I don't
> know.
>
> what was the experiments you talk about done on, if it's not a secret?

@inproceedings{ hosking92comparative,

author = "Antony L. Hosking and J. Eliot B. Moss and Darko Stefanovic",
title = "A comparative performance evaluation of write barrier implementations",
booktitle = "Proceedings of the Conference on Object-Oriented Programming Systems, Languages, and Applications ({OOPSLA})",
journal = "SIGPLAN Notices",
volume = "27",
number = "10",
publisher = "ACM Press",
address = "New York, NY",
editor = "Andreas Paepcke",
pages = "92--109",
year = "1992",
url = "citeseer.nj.nec.com/hosking92comparative.html" }

and

@techreport{ benjamin90barrier,
author = "Zorn, Benjamin",
title = "Barrier Methods for Garbage Collection",
number = "CU-CS-494-90",
year = "1990",
url = "citeseer.nj.nec.com/zorn90barrier.html" }

It would be interesting to see more recent numbers, though.

--
~jrm

Duane Rettig

unread,

Dec 19, 2003, 11:10:50 PM12/19/03

to

Michael Livshin <use...@cmm.kakpryg.net> writes:

> Joe Marshall <prunes...@comcast.net> writes:
>
> > Luke Gorrie <lu...@bluetail.com> writes:
> >
> > Some experiments have shown that checking the write-barrier in
> > software by doing a conditional branch on every write can outperform
> > hardware checking based on mprotect.
>
> that's very system-dependent. conventional wisdom holds, for
> instance, that on Linux signals (and context switches in general) are
> very fast, so there the mprotect-based approach can be a win. I'm not
> sure how much one can trust conventional wisdom here, though. are
> signals as fast on SMP systems as they are on single-CPU ones? is
> this still true for reasonably current kernel/libc versions? I don't
> know.

It depends on what you call "fast". Is 100 - 200 cycles fast?
Probably, because it is certainly faster than you can blink.
But if you compare that with the 5-10 cycles that a software
test might take, then we're dealing with molasses...

Why does trap handling take so much longer than software handling?
Well, a trap handler (especially one which executes user code
in order to determine the behavior) must be general; it must
set up protections so that the user trap handling cannot destroy
the system by bug or by malfeasance, and it must save as much of
the current context (including any used registers) as is necessary
in order to restore the state of the program after the trap.
On the other hand, a software barrier can be optimized by the
compiler to only save those caller-saves regsisters that are
thus volatile, and since there is no context switch, there is
no protection change. The software can get in and out quickly.

On the third hand, I have often raised some ruckus in comp.arch
to ask them there for fast user-level trap handling as the
best way for general-purpose computer architectures to support
garbage-collected langauges. If such an architecture (plus
supporting kernel) were to ever be available in a general
purpose operating system, then even if such a conditional
trap were to take 30-50 cycles, it would still be worthwhile
to pursue it to replace the 5-10 cycle software barrier, since
software barriers tend to never be zero-cost, but traps, when
not taken, do tend to be zero cost, so the challenge is to find
the best algorithm that minimizes the number of times a trap is
actually taken.

--
Duane Rettig du...@franz.com Franz Inc. http://www.franz.com/
555 12th St., Suite 1450 http://www.555citycenter.com/
Oakland, Ca. 94607 Phone: (510) 452-2000; Fax: (510) 452-0182

Luke Gorrie

unread,

Dec 20, 2003, 3:07:06 AM12/20/03

to

Joe Marshall <prunes...@comcast.net> writes:

> Luke Gorrie <lu...@bluetail.com> writes:
>
> > In particular I wonder what the bottleneck was with hardware checking.
>
> Really slow interrupt handling.

Raymond Toy and I have been poking around at this in CMUCL.

For my part, I have reached the following shocking conclusions:

Performance analysis is quite hard.
Microbenchmarks are not very useful.
Profilers are really wonderful.

This is based on "carefully" analysing the performance of the
following program:

(defun foo (x y) (cons x y))
(dotimes (i N) (foo 1 2))

Hopefully some tidbits of interesting, if not useful, CMUCL
information comes from this:

For purely ultra-short-lived data, allocation is much more costly
than garbage collection.

The majority of time was spent in the kernel page-fault handler, but
this has nothing to do with the write-barrier. At the end of garbage
collection, the collector "zeros out" the oldspace by
munmap/mmap'ing it. The kernel implements this lazily by assigning
each page as a copy-on-write version of a reserved page of
zeroes. The first time you write to each of these pages (by
allocating a new cons) the kernel internally faults and zero's a
page of real memory for you.

If you compile CMUCL from source, make sure you build the runtime
system with -O3 (which may not happen by default). For me this made
garbage collection almost three times faster.

Of course we all know that profilers are wonderful, but Oprofile for
Linux is really really wonderful. It is convenient to use (no
instrumentation), profiles everything on your computer at once (all
programs, shared libraries, kernel, device drivers), and causes no
noticable overhead. If you ask it, "What's my computer doing while
CMUCL compiles itself?", it will promptly reply something like:

samples % app name symbol name
194960 73.5565 lisp.core (no symbols)
15843 5.9774 lisp scavenge
9578 3.6137 lisp from_space_p
8743 3.2986 vmlinux default_idle
5810 2.1921 vmlinux do_anonymous_page
4814 1.8163 cc1 (no symbols)
2415 0.9112 vmlinux do_page_fault

Okay, so it's a pity it can't find symbol information in the Lisp
image or stripped C programs. However, it is perfectly happy to
annotate the code (source or disassembly) of the kernel's
`do_anonymous_page' to let you see that it's spending its time writing
zeros into memory, for example.

Finally, though I've been very well brought up not to micro-optimize,
I couldn't help comparing the foo program to this bar program:

(defvar my-cons (cons nil nil))
(defun bar (x y) (setf (car my-cons) x) (setf (cdr my-cons) y) my-cons)
(dotimes (i N) (bar 1 2))

I didn't write down my performance prediction in a sealed envelope,
but I think I was expecting it to be something like 20-50 times
faster. I was surprised to find it more like 500 times. So although
I'm not about to run and rewrite all my programs without using CONS, I
won't be quite so quick to scoff the next time someone tells me that
he really had to optimize the memory management in his program to get
the performance he needed.

I'm really a Watson in search of a Holmes in these matters, so I think
I'll leave it at that for now.

Tallyho,
Luke

Pascal Bourguignon

unread,

Dec 20, 2003, 3:51:05 AM12/20/03

to

Luke Gorrie <lu...@bluetail.com> writes:
> For purely ultra-short-lived data, allocation is much more costly
> than garbage collection.
>
> The majority of time was spent in the kernel page-fault handler, but
> this has nothing to do with the write-barrier. At the end of garbage
> collection, the collector "zeros out" the oldspace by
> munmap/mmap'ing it.

So, here you have the bug in CMLCL!
It should use memset(garbage,0,size) instead.

--
__Pascal_Bourguignon__ . * * . * .* .
http://www.informatimago.com/ . * . .*
There is no worse tyranny than to force * . . /\ ( . *
a man to pay for what he does not . . / .\ . * .
want merely because you think it .*. / * \ . .
would be good for him. -- Robert Heinlein . /* o \ .
http://www.theadvocates.org/ * '''||''' .
SCO Spam-magnet: postm...@sco.com ******************

Luke Gorrie

unread,

Dec 20, 2003, 4:39:30 AM12/20/03

to

Luke Gorrie <lu...@bluetail.com> writes:

> Finally, though I've been very well brought up not to micro-optimize,
> I couldn't help comparing the foo program to this bar program:
>
> (defvar my-cons (cons nil nil))
> (defun bar (x y) (setf (car my-cons) x) (setf (cdr my-cons) y) my-cons)
> (dotimes (i N) (bar 1 2))
>
> I didn't write down my performance prediction in a sealed envelope,
> but I think I was expecting it to be something like 20-50 times
> faster. I was surprised to find it more like 500 times.

*Ahem*. :-)

I'm pleased to report that I botched this _completely_. You might be
well advised to ignore all that I've said, as my CMUCL source tree
appears to be rather badly broken.

I will refrain from reporting a "correction", lest I botch it up too
somehow. It's easy to measure for yourself.

Cons and be merry!

Cheers,
Luke

William D Clinger

unread,

Dec 20, 2003, 10:28:41 AM12/20/03

to

Joe Marshall wrote:
> It would be interesting to see more recent numbers, though.

I'm too lazy to type in a bunch of references, but you can follow
the references in this paper, which itself contains a bunch of
numbers:

David Detlefs, Ross Knippel, William D Clinger, and Matthias Jacob.
Concurrent Remembered Set Refinement in Generational Garbage Collection.
In the Proceedings of the 2002 USENIX Java VM Research and Technology
Symposium, August 2002. Online as PDF or HTML:
http://research.sun.com/jtech/pubs/02-clog.pdf
http://research.sun.com/jtech/pubs/02-clog-html/clog.html

Will

Joe Marshall

unread,

Dec 20, 2003, 11:14:24 AM12/20/03

to

Luke Gorrie <lu...@bluetail.com> writes:

> Joe Marshall <prunes...@comcast.net> writes:
>
>> Luke Gorrie <lu...@bluetail.com> writes:
>>
>> > In particular I wonder what the bottleneck was with hardware checking.
>>
>> Really slow interrupt handling.
>
> Raymond Toy and I have been poking around at this in CMUCL.
>
> For my part, I have reached the following shocking conclusions:
>
> Performance analysis is quite hard.
> Microbenchmarks are not very useful.
> Profilers are really wonderful.

I'll concur with these.

> Hopefully some tidbits of interesting, if not useful, CMUCL
> information comes from this:
>
> For purely ultra-short-lived data, allocation is much more costly
> than garbage collection.

I can believe this. I don't know the details of CMUCL, but other
systems have a lot of overhead in allocation.

Allocation from an arena shared across threads needs to be locked.
The solution here is to give each thread its own top-level
allocation arena.

Many (or most) allocators check the allocation limit at each
allocation. One possible strategy is to use a separate thread to
check the remaining room. Of course this thread had better be very
high priority. Another is to protect a page at the end of the
arena.

Many systems rely on zeroed pages for GC safety. Although this
avoids problems with the GC discovering bad data between allocating
the cell and filling it, the cell ends up being written twice.

Some systems avoid the zeroing problem by having special arenas for
certain kinds of ultra-short-lived data like floats and bignums.

Finally, the few instructions used for allocation tend to really
exercise the memory bus. Hand scheduling the instructions can
produce a noticable benefit.

> The majority of time was spent in the kernel page-fault handler, but
> this has nothing to do with the write-barrier. At the end of garbage
> collection, the collector "zeros out" the oldspace by
> munmap/mmap'ing it. The kernel implements this lazily by assigning
> each page as a copy-on-write version of a reserved page of
> zeroes. The first time you write to each of these pages (by
> allocating a new cons) the kernel internally faults and zero's a
> page of real memory for you.

If the page really needs to be zeroed, it'd probably be better to do
so in the process rather than in the kernel.

> So although
> I'm not about to run and rewrite all my programs without using CONS, I
> won't be quite so quick to scoff the next time someone tells me that
> he really had to optimize the memory management in his program to get
> the performance he needed.

Sometimes you need to optimize memory management. But it is far
better to check the performance first.

--
~jrm

Rahul Jain

unread,

Dec 21, 2003, 4:41:44 PM12/21/03

to

Jan Rychter <j...@rychter.com> writes:

> I've recently been wondering how much of a real problem it is that most
> (all?) GCs tend to move data around in memory. This wasn't really a
> problem back in the days where memory was running at approximately same
> speed as the CPU -- but these days (on PC architectures at least) our
> memories are still below 400MHz, while our CPUs go way up into GHz. A
> single memory access (with a cache miss) can easily cost you hundreds of
> CPU instructions. Cache suddenly becomes extremely valuable and moving
> data in memory generally ruins that.

Not all GCs move the data around, but for those that do, the fact that
they actively *compact* the memory actually makes them more
cache-efficient. Also, they often move the data by following
pointers. This means that following a chain of references is likely to
involve accessing data that is spatially near.

--
Rahul Jain
rj...@nyct.net
Professional Software Developer, Amateur Quantum Mechanicist

Richard Fateman

unread,

Dec 21, 2003, 6:42:42 PM12/21/03

to

Jan Rychter wrote:

> Mind you, this is just speculation, I don't have measurements. I was
> just wondering whether on today's architectures it isn't better to burn
> more CPU cycles in a GC that tries very hard to touch as little data in
> memory as possible.
>
> I'm curious if any of the implementors present here have anything to say
> about this?
>

Check out
http://www.cs.berkeley.edu/~fateman/papers/cachelisp.pdf

Yes, you can and probably should do things to reduce cache misses.
The positive side is that recent chips tend to have rather large
caches (L2 cache, anyway), and many programs will mostly fit there.
Of course you might find that much of it is thrashed periodically
by your gluttonous operating system, but that's life. Unfortunately
the L1 cache is generally too small to store enough to matter.

Also there are absolutely great tools (PAPI) for trying things out.
I used Allegro CL but other FFI-capable Lisps should be able to run
the cache statistics registers on Pentia, Sparc, etc.

RJF

Luke Gorrie

unread,

Dec 21, 2003, 11:33:54 PM12/21/03

to

Hi Joe,

Thanks for the references.

> If the page really needs to be zeroed, it'd probably be better to do
> so in the process rather than in the kernel.

One advantage of doing it in the kernel is that it happens lazily
rather than in the GC, so GCs don't "stop the world" for as long.

The performance difference doesn't seem very significant on CMUCL with
Linux/x86. People interested in playing around can use this snippet to
make the GC zero out memory with the CPU:

(alien:def-alien-variable ("gencgc_unmap_zero" gencgc-unmap-zero) c-call::int)
(setq gencgc-unmap-zero 0)

> Sometimes you need to optimize memory management. But it is far
> better to check the performance first.

And to check it carefully. I'm amazed that after getting into
"micro-optimize" mode - disassembling the kernel, thinking about
protection faults, etc - everything started to feel so expensive that
I swallowed benchmark results that were off by at least two orders of
magnitude. It is all too easy to disengage one's brain in such
situations, at least for me.

The results I posted (and have tried to retract) drastically
exaggerate the cost of zeroing memory. I had, uh, somehow broken my
CMUCL such that calling CONS would allocate several kilobytes of
memory, thus triggering very frequent collections (and
re-initializations) of mostly-unused memory.

The reality seems to be that zero'ing and GC'ing is quite cheap
compared with actually doing something (e.g. poking CARs and CDRs
into) all of that memory. Emphasis on "seems" of course. :-)

Cheers,
Luke

Rob Warnock

unread,

Dec 22, 2003, 6:21:17 AM12/22/03

to

Joe Marshall <prunes...@comcast.net> wrote:
+---------------

+---------------

I remember reading some papers about that a while back. One in particular,
<URL:http://citeseer.nj.nec.com/hosking92comparative.html>, examined
various kinds of write barriers and concluded that a software write
barrier using card marking (with fairly small cards, 256-512 bytes)
out-performed kernel/VM page-based hardware write barriers:

The page trapping scheme performed poorly in comparison to
card marking. Interestingly, this does not appear to be due to
the overhead of fielding page traps, since that is included in
running time, which was not significantly higher (and often lower)
than in the card marking schemes. Rather, it is because pages
are too large a granule so they miss the optimum card size.

That is, a single write means that the whole page gets "dirtied"; you have
to scan the entire page at the next GC. Whereas with a card-marking system
you only have to scan the (presumably much smaller) card that was dirtied.

We can summarize the conclusions as follows: a card size of 256
or 512 bytes appears optimal for card marking on this hardware;
page trapping was surprisingly effective, but is not the best
scheme because its granularity is too large; and remembered sets[1]
are best overall.

-Rob

[1] "Remembered sets" are yet another way of doing software write barriers,
providing an even more precise record of the significant stores:

...associates a remembered set with each generation, recording
the objects or locations in older generations that may contain
pointers into that generation. Any pointer store that creates
a reference from an older generation to a younger generation
is recorded in the remembered set for the younger generation.
At scavenge time the remembered sets for the generations being
scavenged include the heap root set for the scavenge.
...
Our remembered sets are implemented as circular hash tables
using linear hashing. ...

-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607

Rob Warnock

unread,

Dec 22, 2003, 6:35:46 AM12/22/03

to

Duane Rettig <du...@franz.com> wrote:
+---------------

| On the third hand, I have often raised some ruckus in comp.arch
| to ask them there for fast user-level trap handling as the
| best way for general-purpose computer architectures to support
| garbage-collected langauges. If such an architecture (plus
| supporting kernel) were to ever be available in a general
| purpose operating system, then even if such a conditional
| trap were to take 30-50 cycles, it would still be worthwhile
| to pursue it to replace the 5-10 cycle software barrier, since
| software barriers tend to never be zero-cost, but traps, when
| not taken, do tend to be zero cost, so the challenge is to find
| the best algorithm that minimizes the number of times a trap is
| actually taken.

+---------------

But notice that the Hosking, et al, paper I cited in a parallel reply
made the point that it's not necessarily the cost of the trap that's
the killer, but the large size of a hardware page [4 KB or larger]
compared to the optimum card size [256-512 bytes]. With a page-trapping
system, at GC time you have to scan *all* of each page that got stored
into, even if only one word got stored per page.

Unfortunately, as memory sizes grow, pressure on TLB sizes will only
push hardware page sizes higher & higher. As I noted in the other
reply, some (non-PC) servers already default to 16 KB pages, and in
some applications use *much* larger hardware page sizes -- *megabytes*,
even!! -- to avoid TLB thrashing.[1] In those situations, software write
barriers once again become a big win.

-Rob

[1] SGI's Origin systems, running Irix on MIPS, support up to *16 MB*
pages for large HPC apps. This is aided enormously by the fact that
the MIPS TLB supports multiple page sizes in a single process, so
that not all shared-library pages have to be huge just because the
app is using large pages in some other part of its address space.

Luke Gorrie

unread,

Dec 22, 2003, 9:18:04 AM12/22/03

to

rp...@rpw3.org (Rob Warnock) writes:

> But notice that the Hosking, et al, paper I cited in a parallel reply
> made the point that it's not necessarily the cost of the trap that's
> the killer, but the large size of a hardware page [4 KB or larger]
> compared to the optimum card size [256-512 bytes]. With a page-trapping
> system, at GC time you have to scan *all* of each page that got stored
> into, even if only one word got stored per page.

What seems most important to me is: how do I know when it matters?

Really, I already know that for myself it's so rarely that it's not
even funny. Perhaps once every second year I write a program where +/-
10% performance would be interesting. Even when I do, the "inner-loop"
is usually in the operating system somewhere.

What I'd like is to compile a "Read Before Optimising" sanity
check-list, just to put things in perspective when
premature-optimisation fever sets in. I'm hereby soliciting pointed
slogans :-)

Here are a few ideas to start with:

Worried about taking write-barrier interrupts when you modify old
variables?
Remember, your Linux box is already context-switching on
timer-interrupts one thousand times per second. Why do you think it
won't be lost in the noise?

Worried about dirtying some pages that will need to be scavenged?
Remember, your MP3 player is decompressing and then shipping
200KB/second to your second card. When's the last time you turned
that off to make your computer run faster?

Worried that consing will slow you down?
Remember, your generational garbage collector was written specially
so that you could create as much garbage as you like. Exactly how
much speed are you expecting to gain, and how do you plan to measure
it?

Any takers? I think a nicely cl-typeset quick-reference card could be
just the thing for guys like me.

I'd be interested to read a complementary card written by people who
really do write every-cycle-counts programs all day long too. Is "turn
off your MP3 player" the first law of serious optimization? ;-)

Cheers,
Luke

Raymond Toy

unread,

Dec 22, 2003, 9:46:20 AM12/22/03

to

>>>>> "Rob" == Rob Warnock <rp...@rpw3.org> writes:

Rob> Duane Rettig <du...@franz.com> wrote:
Rob> +---------------
Rob> | On the third hand, I have often raised some ruckus in comp.arch
Rob> | to ask them there for fast user-level trap handling as the
Rob> | best way for general-purpose computer architectures to support
Rob> | garbage-collected langauges. If such an architecture (plus
Rob> | supporting kernel) were to ever be available in a general
Rob> | purpose operating system, then even if such a conditional
Rob> | trap were to take 30-50 cycles, it would still be worthwhile
Rob> | to pursue it to replace the 5-10 cycle software barrier, since
Rob> | software barriers tend to never be zero-cost, but traps, when
Rob> | not taken, do tend to be zero cost, so the challenge is to find
Rob> | the best algorithm that minimizes the number of times a trap is
Rob> | actually taken.
Rob> +---------------

Rob> But notice that the Hosking, et al, paper I cited in a parallel reply
Rob> made the point that it's not necessarily the cost of the trap that's
Rob> the killer, but the large size of a hardware page [4 KB or larger]
Rob> compared to the optimum card size [256-512 bytes]. With a page-trapping
Rob> system, at GC time you have to scan *all* of each page that got stored
Rob> into, even if only one word got stored per page.

Just a side note on an experiment I just did with CMUCL. Generational
GC was recently ported to cmucl for the Sparc archictecture. I ran
Eric Marsden's CL benchmark suite on both the non-gencgc version and
the gencgc version. Most benchmarks were roughly equal, but a few
showed that gencgc was 3 times slower.

As an experiment, I changed the page size used by the gencgc
algorithm. It was 8K, but I changed it to 64K. According to the
benchmarks, gencgc was much improved, and only a little (25%?) slower
than the non-gencgc version.

Note, however, that this gencgc implementation works by doing a kernel
trap every once in a while to allocate memory when the inline
allocator runs out of space. By changing the page size from 8K to
64K, the kernel trap would get called much less often. Also, the
inline allocator in the gencgc is 2-3 times more expensive than the
non-gencgc allocator, which takes about 2-3 instructions to allocate
memory.

Anyway, just a note on some simple experiments I did.

Ray

Raymond Toy

unread,

Dec 22, 2003, 9:48:35 AM12/22/03

to

>>>>> "Richard" == Richard Fateman <rfat...@sbcglobal.net> writes:

Richard> Also there are absolutely great tools (PAPI) for trying things out.
Richard> I used Allegro CL but other FFI-capable Lisps should be able to run
Richard> the cache statistics registers on Pentia, Sparc, etc.

Eric Marsden's CL benchmark suite has a set of routines for accessing
these registers for CMUCL. The last time I ran it with his benchmark
routines, I don't think cache misses were much of problem for this
benchmark. FWIW. YMMV.

Ray

Marcin 'Qrczak' Kowalczyk

unread,

Dec 22, 2003, 10:12:52 AM12/22/03

to

On Sun, 21 Dec 2003 16:41:44 -0500, Rahul Jain wrote:

> Also, they often move the data by following pointers. This means that
> following a chain of references is likely to involve accessing data that
> is spatially near.

The Cheney algorithm performs a breadth-first traversal of live objects,
so it actually puts referenced data far from the references. Is there
a reasonably simple modification which does this better?

Depth-first would be good but it would require to maintain a separate
stack of objects to be processed. Since object chains can be long, it
can't even be the native stack. But maybe this would be the answer?

I've just read <http://research.microsoft.com/~trishulc/papers/ismm_paper.pdf>
but realtime profiling is a bit too intrusive and complicated for my taste.

--
__("< Marcin Kowalczyk
\__/ qrc...@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/

Marcin 'Qrczak' Kowalczyk

unread,

Dec 22, 2003, 10:19:54 AM12/22/03

to

On Mon, 22 Dec 2003 05:21:17 -0600, Rob Warnock wrote:

> ...associates a remembered set with each generation, recording
> the objects or locations in older generations that may contain
> pointers into that generation.

Is it better to store objects or locations? Neither is *obviously* better:
storing objects avoids storing the same object multiple times when it's
written to sequentially (if we avoid storing the same pointer twice in a
row), OTOH storing locations needs to scan fewer objects during GC. But
maybe one or the other has been measured to be usually better?

And how many previous stores is it worth to compare? Qish even compares
three recent addresses (and it stores objects, not locations).

Joe Marshall

unread,

Dec 22, 2003, 11:13:08 AM12/22/03

to

Marcin 'Qrczak' Kowalczyk <qrc...@knm.org.pl> writes:

> On Sun, 21 Dec 2003 16:41:44 -0500, Rahul Jain wrote:
>
>> Also, they often move the data by following pointers. This means that
>> following a chain of references is likely to involve accessing data that
>> is spatially near.
>
> The Cheney algorithm performs a breadth-first traversal of live objects,
> so it actually puts referenced data far from the references. Is there
> a reasonably simple modification which does this better?

Use a stack.

> Depth-first would be good but it would require to maintain a separate
> stack of objects to be processed. Since object chains can be long, it
> can't even be the native stack. But maybe this would be the answer?

There are a couple of ways to deal with this. One is to use a bounded
stack and to fall back on breadth-first when the stack overflows. But
these days, it's pretty easy to just make the stack big enough.

--
~jrm

Richard Fateman

unread,

Dec 22, 2003, 1:33:38 PM12/22/03

to

I had not seen Eric Marsden's cache counting, but looking at it,
it seems to be specific to CMU-CL and Sparc v9. PAPI is available
for users on linux or windows on Intel, as well as Sparc, Alpha,
PowerPC. Presumably there is no inherent dependence on CMU-CL in
the Sun performance montorin routines!

In my experiments you have to exceed the L2 cache size to see things
slow down. You generally can't fit enough in the L1 cache to see a
slowdown between L1 and L2 sizes. My guess is that for a lot of
small to medium-large lisp calculations the built-in cache behavior
is quite acceptable, which probably includes most benchmarks that
predate the era of gigabyte RAM. Of course there are a couple more
cliffs to fall off: exceeding L2 cache, exceeding L3 cache if there
is such a thing (maybe out at 2.5-3megabytes?) exceeding RAM, when
you fall into the hole of running paging to disk.

RJF

William D Clinger

unread,

Dec 22, 2003, 4:46:15 PM12/22/03

to

Concerning the write barrier, Marcin 'Qrczak' Kowalczyk asked:

> Is it better to store objects or locations? Neither is *obviously* better:
> storing objects avoids storing the same object multiple times when it's
> written to sequentially (if we avoid storing the same pointer twice in a
> row), OTOH storing locations needs to scan fewer objects during GC.

With a card table barrier, storing objects is likely to take two machine
instructions; storing locations is likely to take three. It would cost
a lot more to avoid storing the same object or location twice in a row,
but one of the advantages of card tables is that storing the same object
or location twice makes no difference.

> But maybe one or the other has been measured to be usually better?

There are too many interacting variables here for anyone to draw general
conclusions on which is better. This issue interacts with the barrier
technology, the gc technology, the application, and the hardware.

Besides, the two answers you consider are not mutually exclusive. Some
assignments can use a write barrier that stores objects while others use
a barrier that stores locations or approximate locations (e.g. cards).

Will

Larry Clapp

unread,

Dec 24, 2003, 5:40:16 PM12/24/03

to

In article <lh1xqxx...@dodo.bluetail.com>, Luke Gorrie wrote:
> What I'd like is to compile a "Read Before Optimising" sanity
> check-list, just to put things in perspective when
> premature-optimisation fever sets in. I'm hereby soliciting pointed
> slogans :-)

Though for Perl, I've found this one fairly helpful:

http://magnonel.guild.net/~schwern/talks/How_To_Be_Lazy/full_slides/rules_of_optimization.html

My favorite rules of optimization (excerpted from above):
Rule #1: Don't do it.
Rule #2: Don't do it yet.

--
Larry Clapp / la...@theclapp.org

Joe Marshall

unread,

Dec 24, 2003, 5:49:20 PM12/24/03

to

Larry Clapp <la...@theclapp.org> writes:

Rule #3: Whatever you do, don't do it in Perl.

--
~jrm

Rob Warnock

unread,

Dec 25, 2003, 2:08:05 AM12/25/03

to

Richard Fateman <rfat...@sbcglobal.net> wrote:
+---------------

| In my experiments you have to exceed the L2 cache size to see things
| slow down.

+---------------

Ungar's original paper[1] proposing a fixed-location nursery for
generational GCs made exactly this point: For best performance
the size of the nursery should be somewhere around the size of
the secondary cache.

-Rob

[1] I don't have the reference at hand at the moment, but could dig
it out of my hard-copy file of GC papers when I get back home
next week.

Raymond Toy

unread,

Dec 29, 2003, 10:12:22 AM12/29/03

to

>>>>> "Richard" == Richard Fateman <rfat...@sbcglobal.net> writes:

Richard> I had not seen Eric Marsden's cache counting, but looking at it,
Richard> it seems to be specific to CMU-CL and Sparc v9. PAPI is available
Richard> for users on linux or windows on Intel, as well as Sparc, Alpha,
Richard> PowerPC. Presumably there is no inherent dependence on CMU-CL in
Richard> the Sun performance montorin routines!

No, there's not. It just needs an FFI to the Sun routines, and I
guess Eric only made them work with CMUCL. But Sparc v9 is a
requirement, I think, because that's when the performance monitoring
registers were made available.

Richard> In my experiments you have to exceed the L2 cache size to see things
Richard> slow down. You generally can't fit enough in the L1 cache to see a
Richard> slowdown between L1 and L2 sizes. My guess is that for a lot of
Richard> small to medium-large lisp calculations the built-in cache behavior
Richard> is quite acceptable, which probably includes most benchmarks that
Richard> predate the era of gigabyte RAM. Of course there are a couple more

I think the benchmarks themselves are fairly small in code space.
Some of the benchmarks do, however, cons quite a bit. Say a few
hundred megabytes for some tests.

Ray

Eric Marsden

unread,

Dec 29, 2003, 3:10:40 PM12/29/03

to

>>>>> "rt" == Raymond Toy <t...@rtp.ericsson.se> writes:
>>>>> "Richard" == Richard Fateman <rfat...@sbcglobal.net> writes:

Richard> I had not seen Eric Marsden's cache counting, but looking at it,
Richard> it seems to be specific to CMU-CL and Sparc v9. PAPI is available
Richard> for users on linux or windows on Intel, as well as Sparc, Alpha,
Richard> PowerPC. Presumably there is no inherent dependence on CMU-CL in
Richard> the Sun performance montorin routines!

rt> No, there's not. It just needs an FFI to the Sun routines, and I
rt> guess Eric only made them work with CMUCL. But Sparc v9 is a
rt> requirement, I think, because that's when the performance monitoring
rt> registers were made available.

the Solaris CPU Performance Counters API also provides access to the
performance counters when running on x86 platforms. However, the types
of events that can be monitored differ from UltraSPARC, and I don't
have any Solaris/x86 machines, so haven't investigated this.

Accessing the x86 performance counters from Linux requires a kernel
patch that I haven't yet installed. I expect that comparing the
CMUCL/x86 and CMUCL/UltraSPARC results would be interesting, since
recent x86 processors do out-of-order execution, which could reduce
the impact of the relatively poor instruction scheduling of the code
generated by CMUCL.

--
Eric Marsden <URL:http://www.laas.fr/~emarsden/>

Eric Marsden

unread,

Dec 29, 2003, 3:11:32 PM12/29/03

to

>>>>> "rf" == Richard Fateman <rfat...@sbcglobal.net> writes:

rf> In my experiments you have to exceed the L2 cache size to see things
rf> slow down. You generally can't fit enough in the L1 cache to see a
rf> slowdown between L1 and L2 sizes. My guess is that for a lot of
rf> small to medium-large lisp calculations the built-in cache behavior
rf> is quite acceptable, which probably includes most benchmarks that
rf> predate the era of gigabyte RAM.

I have run my performance benchmarks on 3 UltraSPARC machines: an el
cheapo Blade 100, a Sun-Fire-V240 and a Sun-Fire-280R. All machines
have enough RAM to run the tests without swapping.

- UltraSPARC-IIe @ 500MHz: 16kB icache, 16kB dcache, 256kB L2 cache
- UltraSPARC-IIIi @ 1GHz: 32kB icache, 65kB dcache, 1MB L2 cache
- UltraSPARC-III @ 700MHz: 32kB icache, 65kB dcache, 8MB L2 cache

These experiments suggest that:

- many of the benchmarks _do_ fit into the L1 instruction cache, but
for those with a larger code size (eg using the pretty-printer, use of
complex numbers, or the entire CMUCL compiler) there can be a 30%
miss rate.

- a larger L2 cache doesn't seem to help greatly over 1MB, except
for benchmarks that are designed to thrash the cache. (It sure
does increase the price of the processor, though :-)

- the quality of instruction scheduling seems to be more important
on UltraSPARC than considerations related to cache: on average
around 25% of cycles are lost to load stalls.

I used the processor's performance counters to measure metrics such as

- cycles per instruction: these processors are 4-way superscalar so
have a best-case CPI of 0.25 with a perfect mix of floating
point and integer operations. Given that few of the benchmarks
use floating point, ideal CPI is 0.5. The best attained by CMUCL
is 0.8.

- rate of instruction-cache misses, and rate of stalls due to
icache misses

- rate of L2 cache misses

- rate of load-use stalls and stalls due to mispredicted branches

Unfortunately, in some cases the relevant events differ between US-II
and US-III. It is also possible (though more difficult) to obtain
other measures such as dcache miss/stall rates, but it requires
multiple runs observing different events.

I have appended some raw results. Most of these benchmarks are
included in the cl-bench suite, but I've added a few new tests:

- COMPILER involves running the CMUCL compiler on a source file

- WALK-LIST/SEQ and WALK-LIST/MESS are inspired by the code in
Fateman's cachelisp paper; both walk a list of 2M fixnums,
sequentially allocated in the former case and merge-sorted in the
latter case (to spread pointers all over memory)

The columns are IPC, icache miss rate, icache stall rate, ecache
miss rate, load-use stall rate. All the rates are percentages of
total number of cycles in user space, for an execution of between 1
and 3 seconds, with full gc between each test.

=== Dual UltraSPARC III @ 700 MHz, 32kB icache, 65kB dcache, 8MB ecache ===

;; COMPILER 2.09 [i: 17.2 34.8 ] [e: 1.1] 6.4
;; LOAD-FASL 1.51 [i: 8.6 20.3 ] [e: 0.8] 11.0
;; PERMUTATIONS 1.44 [i: 7.5 1.3 ] [e: 1.0] 6.8
;; WALK-LIST/SEQ 2.08 [i: 0.4 0.1 ] [e: 9.9] 4.3
;; WALK-LIST/MESS 6.69 [i: 2.2 0.0 ] [e: 27.3] 1.8
;; BOYER 1.44 [i: 10.2 1.1 ] [e: 0.3] 8.5
;; BROWSE 1.48 [i: 6.6 3.5 ] [e: 0.6] 7.5
;; DDERIV 2.42 [i: 9.0 3.2 ] [e: 2.2] 6.0
;; DERIV 2.43 [i: 9.8 3.1 ] [e: 2.2] 6.2
;; DESTRUCTIVE 1.47 [i: 12.1 1.3 ] [e: 0.6] 5.5
;; DIV2-TEST-1 3.40 [i: 2.0 2.5 ] [e: 2.3] 5.2
;; DIV2-TEST-2 2.41 [i: 7.9 2.0 ] [e: 1.2] 5.2
;; FFT 1.26 [i: 2.2 2.0 ] [e: 0.0] 6.2
;; FRPOLY/FIXNUM 1.74 [i: 14.5 1.7 ] [e: 0.4] 9.1
;; FRPOLY/BIGNUM 1.35 [i: 6.9 3.6 ] [e: 0.4] 10.4
;; FRPOLY/FLOAT 1.59 [i: 10.0 1.6 ] [e: 0.3] 13.1
;; PUZZLE 0.94 [i: 1.7 1.7 ] [e: 0.0] 20.5
;; TAK 1.00 [i: 10.7 0.5 ] [e: 0.0] 9.8
;; CTAK 1.29 [i: 0.1 0.5 ] [e: 0.0] 8.1
;; TRTAK 1.04 [i: 0.1 12.8 ] [e: 0.0] 3.6
;; TAKL 1.12 [i: 4.3 0.1 ] [e: 0.0] 3.0
;; STAK 1.25 [i: 5.6 0.2 ] [e: 0.0] 19.9
;; FPRINT/UGLY 1.04 [i: 4.7 3.6 ] [e: 0.0] 15.2
;; FPRINT/PRETTY 1.47 [i: 15.1 29.6 ] [e: 0.1] 10.3
;; TRAVERSE 1.44 [i: 5.6 0.2 ] [e: 0.2] 7.3
;; TRIANGLE 0.96 [i: 4.5 0.8 ] [e: 0.0] 18.2
;; RICHARDS 1.43 [i: 15.3 0.9 ] [e: 0.0] 14.1
;; FACTORIAL 1.91 [i: 6.2 16.0 ] [e: 1.3] 12.2
;; FIB 1.57 [i: 22.7 0.6 ] [e: 0.0] 2.7
;; FIB-RATIO 0.95 [i: 3.6 1.4 ] [e: 0.1] 9.0
;; ACKERMANN 1.73 [i: 9.5 0.3 ] [e: 0.1] 4.4
;; MANDELBROT 1.41 [i: 6.5 13.8 ] [e: 0.6] 7.4
;; MRG32K3A 0.92 [i: 0.0 0.0 ] [e: 0.0] 13.1
;; CRC40 1.08 [i: 6.1 1.6 ] [e: 0.7] 9.0
;; BIGNUM/ELEM-100-1000 0.96 [i: 3.4 1.6 ] [e: 0.1] 9.8
;; BIGNUM/ELEM-1000-100 0.89 [i: 0.8 0.3 ] [e: 0.0] 11.4
;; BIGNUM/ELEM-10000-1 0.95 [i: 0.1 0.1 ] [e: 0.0] 14.1
;; BIGNUM/PARI-100-10 0.84 [i: 2.4 0.6 ] [e: 0.0] 10.4
;; BIGNUM/PARI-200-5 0.81 [i: 1.0 0.3 ] [e: 0.0] 11.4
;; PI-DECIMAL/SMALL 0.87 [i: 1.9 1.1 ] [e: 0.1] 10.7
;; PI-DECIMAL/BIG 0.85 [i: 1.2 0.6 ] [e: 0.1] 11.5
;; PI-ATAN 1.55 [i: 3.6 6.0 ] [e: 1.2] 8.7
;; PI-RATIOS 0.93 [i: 3.4 2.8 ] [e: 0.1] 9.4
;; SLURP-LINES 1.54 [i: 8.6 23.4 ] [e: 0.7] 10.5
;; HASH-STRINGS 1.01 [i: 5.0 0.7 ] [e: 0.2] 12.6
;; HASH-INTEGERS 1.14 [i: 4.5 1.1 ] [e: 0.5] 10.6
;; BOEHM-GC 1.88 [i: 10.9 2.4 ] [e: 2.8] 4.6
;; DEFLATE-FILE 1.00 [i: 2.9 2.4 ] [e: 0.1] 14.7
;; 1D-ARRAYS 0.95 [i: 5.5 0.9 ] [e: 0.1] 18.3
;; 2D-ARRAYS 1.07 [i: 2.5 0.1 ] [e: 0.4] 22.2
;; 3D-ARRAYS 0.89 [i: 3.5 0.2 ] [e: 0.2] 22.7
;; BITVECTORS 1.27 [i: 0.1 0.5 ] [e: 1.5] 8.2
;; BENCH-STRINGS 1.26 [i: 13.4 0.3 ] [e: 0.1] 10.9
;; fill-strings/adjustable 0.90 [i: 6.3 0.3 ] [e: 0.0] 16.8
;; STRING-CONCAT 1.05 [i: 6.0 0.6 ] [e: 0.2] 10.7
;; SEARCH-SEQUENCE 1.18 [i: 8.2 1.5 ] [e: 0.3] 13.1

=== Dual UltraSPARC IIIi @ 1GHz, 32kB icache, 65kB dcache, 1MB ecache ===

;; COMPILER 2.14 [i: 14.5 38.8 ] [e: 1.7] 5.8
;; LOAD-FASL 1.55 [i: 7.6 18.4 ] [e: 0.8] 11.8
;; PERMUTATIONS 1.30 [i: 7.2 1.2 ] [e: 1.0] 6.9
;; WALK-LIST/SEQ 2.24 [i: 0.0 0.0 ] [e: 12.5] 3.8
;; WALK-LIST/MESS 11.77 [i: 2.1 0.0 ] [e: 67.4] 0.8
;; BOYER 1.43 [i: 9.5 0.9 ] [e: 0.3] 8.5
;; BROWSE 1.53 [i: 4.5 2.8 ] [e: 0.6] 8.0
;; DDERIV 2.62 [i: 7.4 2.3 ] [e: 2.2] 5.4
;; DERIV 2.68 [i: 8.5 2.5 ] [e: 2.2] 5.7
;; DESTRUCTIVE 1.51 [i: 12.9 0.9 ] [e: 0.6] 5.1
;; DIV2-TEST-1 3.60 [i: 1.9 2.0 ] [e: 2.3] 4.8
;; DIV2-TEST-2 2.47 [i: 7.4 1.6 ] [e: 1.2] 5.0
;; FFT 1.22 [i: 1.2 0.7 ] [e: 0.0] 6.4
;; FRPOLY/FIXNUM 1.72 [i: 13.0 1.5 ] [e: 0.4] 8.7
;; FRPOLY/BIGNUM 1.38 [i: 6.7 3.3 ] [e: 0.4] 10.1
;; FRPOLY/FLOAT 1.59 [i: 9.2 1.3 ] [e: 0.3] 13.1
;; PUZZLE 1.01 [i: 2.2 1.6 ] [e: 0.0] 20.4
;; TAK 0.98 [i: 6.0 0.4 ] [e: 0.0] 3.5
;; CTAK 1.24 [i: 7.6 0.5 ] [e: 0.0] 8.1
;; TRTAK 0.99 [i: 7.8 2.6 ] [e: 0.0] 5.6
;; TAKL 1.07 [i: 3.8 0.1 ] [e: 0.0] 4.3
;; STAK 1.37 [i: 4.8 0.1 ] [e: 0.0] 18.7
;; FPRINT/UGLY 1.03 [i: 5.2 2.7 ] [e: 0.0] 15.4
;; FPRINT/PRETTY 1.37 [i: 11.5 26.1 ] [e: 0.1] 10.8
;; TRAVERSE 1.38 [i: 5.3 0.1 ] [e: 0.0] 7.2
;; TRIANGLE 0.98 [i: 5.2 0.0 ] [e: 0.0] 17.9
;; RICHARDS 1.51 [i: 13.0 0.7 ] [e: 0.0] 14.9
;; FACTORIAL 1.85 [i: 8.3 12.2 ] [e: 1.3] 13.2
;; FIB 1.59 [i: 18.6 0.5 ] [e: 0.0] 2.5
;; FIB-RATIO 0.97 [i: 5.6 1.2 ] [e: 0.2] 9.1
;; ACKERMANN 1.73 [i: 17.8 0.9 ] [e: 0.0] 1.4
;; MANDELBROT 1.37 [i: 9.3 11.4 ] [e: 0.6] 6.7
;; MRG32K3A 0.90 [i: 0.0 0.0 ] [e: 0.0] 13.3
;; CRC40 1.08 [i: 5.7 1.3 ] [e: 0.7] 9.1
;; BIGNUM/ELEM-100-1000 0.97 [i: 3.5 1.4 ] [e: 0.2] 9.7
;; BIGNUM/ELEM-1000-100 0.88 [i: 0.8 0.3 ] [e: 0.0] 11.4
;; BIGNUM/ELEM-10000-1 0.93 [i: 0.1 0.0 ] [e: 0.0] 13.6
;; BIGNUM/PARI-100-10 0.84 [i: 2.6 0.5 ] [e: 0.0] 10.3
;; BIGNUM/PARI-200-5 0.81 [i: 1.1 0.2 ] [e: 0.0] 11.3
;; PI-DECIMAL/SMALL 0.87 [i: 2.4 1.0 ] [e: 0.1] 10.6
;; PI-DECIMAL/BIG 0.85 [i: 1.2 0.4 ] [e: 0.0] 11.7
;; PI-ATAN 1.51 [i: 3.8 5.1 ] [e: 1.1] 8.8
;; PI-RATIOS 0.94 [i: 4.4 2.4 ] [e: 0.1] 9.1
;; SLURP-LINES 1.37 [i: 7.7 14.4 ] [e: 0.2] 11.8
;; HASH-STRINGS 1.00 [i: 5.1 0.8 ] [e: 0.2] 12.1
;; HASH-INTEGERS 1.19 [i: 1.8 0.9 ] [e: 0.8] 9.8
;; BOEHM-GC 1.89 [i: 9.5 1.8 ] [e: 2.8] 4.7
;; DEFLATE-FILE 1.02 [i: 5.1 1.6 ] [e: 0.1] 14.7
;; 1D-ARRAYS 0.96 [i: 5.7 0.7 ] [e: 0.0] 19.0
;; 2D-ARRAYS 1.10 [i: 2.6 0.0 ] [e: 0.5] 20.9
;; 3D-ARRAYS 1.02 [i: 3.1 0.0 ] [e: 0.3] 19.5
;; BITVECTORS 1.46 [i: 0.1 0.5 ] [e: 1.6] 7.1
;; BENCH-STRINGS 1.29 [i: 11.9 0.0 ] [e: 0.2] 10.8
;; fill-strings/adjustable 0.90 [i: 6.0 0.6 ] [e: 0.0] 16.9
;; STRING-CONCAT 0.93 [i: 6.2 0.4 ] [e: 0.2] 11.5
;; SEARCH-SEQUENCE 1.17 [i: 5.9 1.2 ] [e: 0.3] 13.7

=== UltraSPARC-IIe @ 500 MHz, 16kB icache, 16kB dcache, 256kB L2 cache ===

;; COMPILER 2.79 [i: 15.6 37.4 ] [e: 59.2] 21.1
;; LOAD-FASL 1.89 [i: 9.3 24.0 ] [e: 46.2] 23.8
;; PERMUTATIONS 1.19 [i: 0.7 1.8 ] [e: 19.0] 25.2
;; WALK-LIST/SEQ 1.65 [i: 0.0 0.1 ] [e: 57.2] 69.6
;; WALK-LIST/MESS 6.18 [i: 0.0 2.5 ] [e: 91.3] 84.7
;; BOYER 1.49 [i: 3.1 6.6 ] [e: 36.1] 22.0
;; BROWSE 1.34 [i: 2.2 5.7 ] [e: 30.5] 25.6
;; DDERIV 1.76 [i: 1.5 4.0 ] [e: 31.0] 30.8
;; DERIV 1.79 [i: 1.5 3.8 ] [e: 25.9] 24.2
;; DESTRUCTIVE 1.19 [i: 0.3 1.1 ] [e: 17.2] 11.9
;; DIV2-TEST-1 1.98 [i: 1.6 4.1 ] [e: 40.8] 55.9
;; DIV2-TEST-2 1.60 [i: 0.9 2.9 ] [e: 28.1] 43.1
;; FFT 1.12 [i: 0.1 0.3 ] [e: 9.0] 14.1
;; FRPOLY/FIXNUM 1.56 [i: 0.9 2.4 ] [e: 22.0] 18.1
;; FRPOLY/BIGNUM 1.55 [i: 6.6 12.4 ] [e: 43.2] 15.8
;; FRPOLY/FLOAT 1.40 [i: 1.8 4.0 ] [e: 30.8] 23.2
;; PUZZLE 0.96 [i: 0.4 1.8 ] [e: 22.5] 23.2
;; TAK 1.17 [i: 0.1 0.5 ] [e: 1.4] 20.5
;; CTAK 1.19 [i: 0.2 0.7 ] [e: 1.1] 18.7
;; TRTAK 1.17 [i: 0.1 0.4 ] [e: 1.6] 18.4
;; TAKL 0.79 [i: 0.0 0.1 ] [e: 0.9] 2.7
;; STAK 1.10 [i: 0.1 0.2 ] [e: 0.2] 20.9
;; FPRINT/UGLY 1.54 [i: 10.2 18.4 ] [e: 59.8] 15.8
;; FPRINT/PRETTY 1.97 [i: 20.4 30.9 ] [e: 65.3] 13.2
;; TRAVERSE 1.69 [i: 0.0 0.2 ] [e: 19.1] 52.4
;; TRIANGLE 0.84 [i: 0.1 0.3 ] [e: 0.9] 16.9
;; RICHARDS 1.55 [i: 0.0 0.0 ] [e: 1.0] 19.3
;; FACTORIAL 2.07 [i: 10.4 15.4 ] [e: 54.4] 9.6
;; FIB 1.32 [i: 0.0 0.1 ] [e: 0.3] 0.9
;; FIB-RATIO 1.02 [i: 1.0 3.3 ] [e: 12.4] 11.7
;; ACKERMANN 1.38 [i: 0.0 0.0 ] [e: 24.5] 15.2
;; MANDELBROT 1.66 [i: 12.1 24.6 ] [e: 54.3] 12.5
;; MRG32K3A 0.87 [i: 0.0 0.0 ] [e: 0.1] 23.3
;; CRC40 1.10 [i: 2.3 5.6 ] [e: 25.0] 13.6
;; BIGNUM/ELEM-100-1000 0.95 [i: 0.6 2.2 ] [e: 9.3] 16.7
;; BIGNUM/ELEM-1000-100 0.87 [i: 0.1 0.3 ] [e: 2.8] 25.8
;; BIGNUM/ELEM-10000-1 1.02 [i: 0.0 0.1 ] [e: 10.3] 27.0
;; BIGNUM/PARI-100-10 0.82 [i: 0.1 0.5 ] [e: 2.0] 20.1
;; BIGNUM/PARI-200-5 0.76 [i: 0.0 0.2 ] [e: 1.1] 26.2
;; PI-DECIMAL/SMALL 0.84 [i: 0.2 1.1 ] [e: 3.9] 20.6
;; PI-DECIMAL/BIG 0.81 [i: 0.1 0.6 ] [e: 2.5] 24.4
;; PI-ATAN 1.52 [i: 3.2 6.9 ] [e: 27.1] 15.2
;; PI-RATIOS 0.95 [i: 0.8 3.1 ] [e: 12.1] 16.1
;; SLURP-LINES 1.47 [i: 10.5 20.7 ] [e: 54.5] 12.0
;; HASH-STRINGS 1.07 [i: 1.2 3.2 ] [e: 29.3] 17.1
;; HASH-INTEGERS 1.27 [i: 1.0 1.6 ] [e: 23.3] 26.3
;; BOEHM-GC 1.48 [i: 0.8 2.9 ] [e: 24.6] 17.2
;; DEFLATE-FILE 0.97 [i: 0.2 1.9 ] [e: 10.3] 15.9
;; 1D-ARRAYS 0.97 [i: 0.0 0.1 ] [e: 2.5] 18.6
;; 2D-ARRAYS 1.12 [i: 0.0 0.1 ] [e: 6.4] 28.5
;; 3D-ARRAYS 0.82 [i: 0.0 0.1 ] [e: 1.1] 20.3
;; BITVECTORS 3.35 [i: 0.2 0.5 ] [e: 42.9] 84.1
;; BENCH-STRINGS 1.17 [i: 0.0 0.0 ] [e: 5.5] 11.3
;; fill-strings/adjustable 0.92 [i: 0.0 0.1 ] [e: 2.8] 17.0
;; STRING-CONCAT 1.19 [i: 4.1 8.1 ] [e: 40.1] 17.1
;; SEARCH-SEQUENCE 1.07 [i: 0.5 1.2 ] [e: 15.5] 19.7