Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

allocator and GC locality (was Re: cost of malloc)

6 views
Skip to first unread message

Paul Wilson

unread,
Jul 27, 1995, 3:00:00 AM7/27/95
to
In article <950726164...@aruba.apple.com>,
Ken Dickey <Ke...@apple.com> wrote:
>At 11:38 AM 95/07/26 -0400, Scott McKay wrote:
>...
>
>>Another thing to note is that compacting GC's usually dramatically
>>reduce ther overall working set of programs, which in turn reduces the
>>number of page faults a program might take. ...

I don't think this is quite right. GC usually costs you a little bit of
locality, relative to a well-designed explicit allocator. There's very
little data on locality of allocators, however. (Zorn's done some basic
stuff, but I don't regard it as conclusive.) We hope to fix that sometime
soon.

GC costs space, because you can't reuse memory immediately--you have
to wait until the GC notices it's garbage. All other things being
equal, this is bad news. Generational GC helps, but does not eliminate
the problem, especially at the level of high-speed caches (as opposed
to virtual memory). If the youngest generation is larger than the
cache, the number of cache misses can go WAY up. (But this may or
may not cost you a lot of stalls.)

A poorly-designed GC (or malloc) can be a locality disaster, because
it can scramble related data. A naive copying collector may do this
when it hits a hash table---it reaches the structures referenced from
the hash table in pseudo-random order, and tends to interleave things
in memory as it traversese and copies them. A conventional next fit
allocator (i.e., first fit with a roving pointer) may do something
similar. Next fit is bad news. (See our PLDI '91 paper for demonstrations
of the effects of copying data in randomized order, and our allocator
survey paper for a discussion of next fit. The latter is on our ftp
site.)

>>So even though time-to-market considerations are important (and I'll
>>bet a GC wins this race every time), I now believe that even your
>>basic overall performance wins in modern GC-based systems.
>
>Which brings to mind another paper: "Cache Performance of Garbage Collected
>Programs" by Mark Reinhold, L&FP '94. Mark presents evidence that even a
>stop & copy collector exhibits good cach performance when amortized over a
>program's execution lifetime and argues that complex compaction schemes
>are not required. Mark also conjectures that a gc'ed mostly functional
>style programming is significantly more efficient than imperative style
>non-gc'ed programs in a cached VM system ("allocation is faster than
>mutation") and the performance gains will be more significant as CPU's
>continue to get faster relative to memory systems.

I'm quite skeptical of this. When you romp through lots of memory on
a regular basis (e.g., with a normal tracing GC, during allocation *between*
GC's), you miss your cache like crazy. On some architectures, this is
not in itself a big deal, because the cache effectively optimizes away
the fetching of garbage blocks from memory, and allocates blocks directly
in the cache. (With good write buffers and high cache-to-memory bandwidth,
the evictions may not be a big deal either.) On other architectures, it is
a big lose because everytime you allocate another block of memory, you fault
in a block of garbage just before you overwrite it during allocation. This
is very dependent on the details of the cache architecture, and most current
processors (including most high-end processors) do NOT have the GC-friendly
features.

I'm especially skeptical of the scalability argument. As processor speeds
grow relative to memory *bandwidth*, the write-backs caused by profligate
heap allocation will become a bottleneck. Whenever you allocate a block,
you have to evict something, and in a GC'd system, the block you evict
is almost always dirty (because it was written to when you allocated something
there a little while ago). Prefetching and preflushing don't help with
this once you're bandwidth-limited. Reusing memory while it's still in
the cache is a better and better idea as processors get faster.

(Appel's new (FPCA '95) paper deals with the issue of interactions between
cache design and GC design. It pretty much vindicates my much-attacked
LFP '92 paper. As it turns out, everybody seems to be right---I'm right
for most architectures, but the "GC has good cache locality" camp (e.g.,
Reinhold; Tarditi, Diwan, and Moss; Appel's earlier work) is right if
you have a NICE cache.)

The bottom line, in my view, is that if you're targeting a range of
machines and can't pick your cache design, you should be very careful
about locality of reference during allocation. Allocate on the stack
when you can do so safely, not on the heap---especially for very fast
processors, where cache stalls start to be really important.

For more info on these things, see the papers on our ftp site.

Paul
--
| Paul R. Wilson, Comp. Sci. Dept., U of Texas @ Austin (wil...@cs.utexas.edu)
| Papers on memory allocators, garbage collection, memory hierarchies,
| persistence and Scheme interpreters and compilers available via ftp from
| ftp.cs.utexas.edu, in pub/garbage or http://www.cs.utexas.edu/users/oops/)

Stefan Monnier

unread,
Jul 28, 1995, 3:00:00 AM7/28/95
to
[comp.arch added since the thread is drifting towards cache design]

In article <3v8g7l$c...@jive.cs.utexas.edu>,
Paul Wilson <wil...@cs.utexas.edu> wrote:
] In article <950726164...@aruba.apple.com>,


] Ken Dickey <Ke...@apple.com> wrote:
] >At 11:38 AM 95/07/26 -0400, Scott McKay wrote:

] >>Another thing to note is that compacting GC's usually dramatically


] >>reduce ther overall working set of programs, which in turn reduces the
] >>number of page faults a program might take. ...

Is there hard data on this one ?
The only case where the GC was really able to reduce the working set size (at
the VM level) is for the TI Explorer. I don't have the reference at hand, but
they compared (on that TI machine) a plain generational copying collector to the
fancy incremental/generational/locality-improving GC for a CAD routing problem.
In most cases, the fancy GC was a slight performance penalty, except in the case
where the machine had little memory, in which case the paging behavior was
noticeably better for the fancy GC (it took about a day to finish, whereas they
gave up on the test for the plain GC after a week). Stramgely, their conclusion
was that the plain GC was superior since it's a lot simpler and in most cases
faster (slightly, but still). Of course the incremental aspect of the GC wasn't
seen as an advantage for a big CAD routing problem (this does call for
parameterizable GC behavior, right ?).

] I don't think this is quite right. GC usually costs you a little bit of

] locality, relative to a well-designed explicit allocator. There's very

I guess it depends on what you call locality. If you only consider the concept
(far from hardware problems), I can't see why: the set of objects used is the
same and I can't see why the layout of objects has to be worse for a GC than for
a manual allocator. Of course, actual hardware generally behaves worse with a GC
because it doesn't quite understand the a way a GC works. The better the cache
can "understand" the GC, the better the locality.

] little data on locality of allocators, however. (Zorn's done some basic


] stuff, but I don't regard it as conclusive.) We hope to fix that sometime
] soon.

I do hope so too, but it's not gonna be easy: there are so many variants of GC
and so many variants of memory hierarchies...

] (Appel's new (FPCA '95) paper deals with the issue of interactions between


] cache design and GC design. It pretty much vindicates my much-attacked
] LFP '92 paper. As it turns out, everybody seems to be right---I'm right
] for most architectures, but the "GC has good cache locality" camp (e.g.,
] Reinhold; Tarditi, Diwan, and Moss; Appel's earlier work) is right if
] you have a NICE cache.)

This is the "understanding" problem: if the cace can notice that reading the
line is not necessary when allocating a new object, it means the cache
understands better the way the GC-allocator works. You're point about the memory
bandwidth of the write-backs could be argued against in saying that there should
be a way for the cache to notice that most of the write-backs aren't necessary
(I'm assuming here that the generations' sizes have been carefuly chosen so that
they "fit the memory hierarchy" (whatever this really means in terms of relative
sizes): in such a case most write-backs will be for "old-space" objects that
have been recognized as dead already). Since there will still be write-backs of
objects that *are* dead but haven't been recognized as such yet, I guess manual
allocators still have an edge. But it might be balanced by the compacting effect
of copying collectors.

For lack of hard data, this is probably blowing hot air, tho !
But the point still holds: it'd probably be a good idea to provide ways to
tell the memory hierarchy what's going on (like a "trash" instruction to
indicate that some memory region can be trashed (to free the old-space) or a
destructive read (can be used for linear-programming, or for stack-pops)).

] The bottom line, in my view, is that if you're targeting a range of


] machines and can't pick your cache design, you should be very careful
] about locality of reference during allocation. Allocate on the stack
] when you can do so safely, not on the heap---especially for very fast
] processors, where cache stalls start to be really important.

Too true: current systems are designed mainly to run stack-based programs, so
it's always safer to try to stick as much as possible to the same paradigm.
Appel's argument about GC and stack being just as fast (while the GC approach is
simpler since there *is* a GC anyway) is not convincing enough: the added
complexity buys you predictability (and it's a "one-time" complexity cost).


Stefan

Paul Wilson

unread,
Jul 28, 1995, 3:00:00 AM7/28/95
to
In article <3vac07$p...@info.epfl.ch>,

Stefan Monnier <stefan....@epfl.ch> wrote:
>[comp.arch added since the thread is drifting towards cache design]

Then I should restate a couple of things: I'm very much pro-GC, for software
engineering reasons, but I think that it ultimately costs you a little
bit in locality, relative to a similarly good explicit deallocator
(malloc/free kind of thing).

Unfortunately, comparisons are quite difficult because the state of
allocator research and development is a bit of a mess. We don't know
much more than we did in 1971, which wasn't a whole lot.

This is important for computer architecture, because locality is
strongly affected by the way storage is managed. Programs reference
data objects, but data objects are mapped onto memory by the allocator
(and the memory hierarchy's own mechanisms).

Failure to realize this is one of the reasons we don't know much about
locality that we didn't know in 1971, either. (When was the last time
you saw a cache evaluation paper that even mentioned what implementation
of malloc() was used? Architects generally aren't away of the potentially
large differences in locality for the same programs using different
allocators.)

>In article <3v8g7l$c...@jive.cs.utexas.edu>,
>Paul Wilson <wil...@cs.utexas.edu> wrote:
>] In article <950726164...@aruba.apple.com>,
>] Ken Dickey <Ke...@apple.com> wrote:
>] >At 11:38 AM 95/07/26 -0400, Scott McKay wrote:
>] >>Another thing to note is that compacting GC's usually dramatically
>] >>reduce ther overall working set of programs, which in turn reduces the
>] >>number of page faults a program might take. ...
>
>Is there hard data on this one ?

Essentially none. Compaction is overrated, by the way. It can do as much
damage as good, if it's not done *right*. As far as comparisons between
conventional and GC'd systems go, there is essentially zero data on locality
effects, and there's every reason to think that locality effects are important.

The allocator literature is much worse than the GC literature in this respect.
Almost nothing is known about the behavior of allocators with real programs,
because the research mostly got off into a dead end of studying fragmentation
for synthetic programs, which are systematically and grossly unrealistic.

For more info on this, see the paper allocsrv.ps in our repository of papers
on memory management. (ftp.cs.utexas.edu, in the directory pub/garbage.)
This is a long survey and literature review on memory allocators. It makes
a few basic points about locality, as well as fragmentation and time costs.

>The only case where the GC was really able to reduce the working set size
>(at the VM level) is for the TI Explorer. I don't have the reference at
> hand,

My GC survey paper and memory management bibliography have citations for
this and some similar work. (Also available in our ftp repository.)

> [...]

>] I don't think this is quite right. GC usually costs you a little bit of
>] locality, relative to a well-designed explicit allocator. There's very
>
>I guess it depends on what you call locality. If you only consider the
>concept (far from hardware problems), I can't see why: the set of objects
>used is the same and I can't see why the layout of objects has to be worse
>for a GC than for a manual allocator.

The basic layout issues are the same, I think. A compacting GC has a lot
of freedom to reorganize data to improve locality or at least not to damage
it much. But a conventional allocator also has huge freedom in its initial
placement of objects, and this can dramatically affect locality.

(This has not been seriously studied, for no reason I can figure out---the
literature on conventional allocators is much less sophisticated about
locality than the literature on GC's. It may be due in part to the fact
that GC people learned about locality the hard way, because early GC's
often had quite horrible locality.)

> Of course, actual hardware generally behaves worse with a GC
>because it doesn't quite understand the a way a GC works. The better the cache
>can "understand" the GC, the better the locality.

I think this is partly right, but only partly. GC's (except for reference
counting GC's) tend to do something that's intrinsically nasty for locality;
they pollute the cache with short-lived objects and can't *reuse* that memory
until a garbage collection detects that the memory is reusable. You can limit
the damage this causes, but you can't eliminate it entirely.

>] little data on locality of allocators, however. (Zorn's done some basic
>] stuff, but I don't regard it as conclusive.) We hope to fix that sometime
>] soon.
>
>I do hope so too, but it's not gonna be easy: there are so many variants of GC
>and so many variants of memory hierarchies...

Yes. On the other hand, I think one of the things that has limited progress
is a failure to focus on the fundamentals---namely program behavior and
what an allocator (or GC) can do to map that behavior nicely onto memory.
Architects tend to have a very naive view of locality, which is that it's
like a natural phenomenon that either is or isn't there. In fact, it's a
product of patterned program behavior interacting with patterned allocator
(or GC) strategies, and these must be studied in relative isolation to
figure out what's really going on. I believe that there are common and
simple patterns in program behavior that have not been noticed or exploited,
and that this is why there's been very little progress in the study of
allocators and the study of locality.

Our survey paper on allocators grinds this axe pretty hard---I think it's
an example of a very common failing in empirical computer science, which
is rushing to formalize complex things in simplistic ways, rather than to
focus on coming up with good *qualitative* models first. This occurs in
architecture and OS work as well as in allocator research. The current
understanding of locality is pretty pathetic---we still haven't gotten
much beyond LRU and Working Set and one-block-lookahead prefetching;
those are all very superficial.

(Recent work on compiler-directed prefetching and similar stuff for database
queries is progress of a sort, but there's a huge gap between these very
specific tricks and the very vague generalities (e.g., LRU), which has not
been explored well at all.)

>] (Appel's new (FPCA '95) paper deals with the issue of interactions between
>] cache design and GC design. It pretty much vindicates my much-attacked
>] LFP '92 paper. As it turns out, everybody seems to be right---I'm right
>] for most architectures, but the "GC has good cache locality" camp (e.g.,
>] Reinhold; Tarditi, Diwan, and Moss; Appel's earlier work) is right if
>] you have a NICE cache.)
>
>This is the "understanding" problem: if the cace can notice that reading the
>line is not necessary when allocating a new object, it means the cache

>understands better the way the GC-allocator works. Your point about the


>memory bandwidth of the write-backs could be argued against in saying that
>there should be a way for the cache to notice that most of the write-backs
>aren't necessary (I'm assuming here that the generations' sizes have been
>carefuly chosen so that they "fit the memory hierarchy" (whatever this
>really means in terms of relative sizes): in such a case most write-backs
>will be for "old-space" objects that have been recognized as dead already).
>Since there will still be write-backs of objects that *are* dead but haven't
>been recognized as such yet, I guess manual allocators still have an edge.

Yes. That's my point. An allocator that knows when objects die always
has a locality advantage over one that waits for a GC to tell it. This
may not matter much for a particular GC in a particular memory hierarchy,
but explicit deallocation always has the edge. In practice, there's a
limit to how small you make the youngest generation of a GC, so for
small, fast caches you will have a lot of misses. For larger caches,
you'll waste some space.

>But it might be balanced by the compacting effect
>of copying collectors.

I don't think so. For small caches, by the time you've compacted data it's
too late---you touched too much memory in between GC's, because you couldn't
reuse memory immediately.

The memory lost to fragmentation in conventional allocators appears to be
widely overestimated---at least for the best allocators. This is partly
because the allocator research area is such a mess that people use the
wrong allocators (e.g., next fit) or don't tweak them in the right ways
(e.g., getting rid of footer overhead for boundary tags).

Our recent results show good allocators having only about 10% fragmentation.
This is comparable to the losses you get just from a header word and rounding
object sizes up to a word or double word for architectural reasons.

(Some caveats are in order---this is for a set of eight varied C and C++
programs, but there are lots of kinds of programs that aren't represented
in *any* small set, and none of our programs runs for a very long time.)

It's not clear to me that compaction is a win, on average. If objects
are placed correctly in the first place, it may not be necessary and it
may be *bad* for locality on average. The allocation order and sizes
of objects are likely to provide very strong hints as to which objects
will be used together. This info should be usable for decreasing
fragmentation and for increasing locality.

>For lack of hard data, this is probably blowing hot air, tho !
>But the point still holds: it'd probably be a good idea to provide ways to
>tell the memory hierarchy what's going on (like a "trash" instruction to
>indicate that some memory region can be trashed (to free the old-space) or a
>destructive read (can be used for linear-programming, or for stack-pops)).

I think cache-controlling instructions are a good idea, but I don't think
we're ready to say which ones are worthwhile. We haven't exploited the
potential of allocator design to improve locality for *all* architectures,
and we haven't figured out what cache hints are most useful for the most
programs.

>] The bottom line, in my view, is that if you're targeting a range of
>] machines and can't pick your cache design, you should be very careful
>] about locality of reference during allocation. Allocate on the stack
>] when you can do so safely, not on the heap---especially for very fast
>] processors, where cache stalls start to be really important.
>
>Too true: current systems are designed mainly to run stack-based programs, so
>it's always safer to try to stick as much as possible to the same paradigm.
>Appel's argument about GC and stack being just as fast (while the GC approach
>is simpler since there *is* a GC anyway) is not convincing enough: the added
>complexity buys you predictability (and it's a "one-time" complexity cost).
>

I think the difference between good GC'd systems and good explicit allocators
is often overstated, partly due to the tendency to see locality in
simplistic terms. GC'd and non-GC'd systems may use memory in pretty
similar ways, if both are well-designed.

Consider a FORTRAN program that uses large arrays (bigger than the cache)
repeatedly. Often, such programs go through arrays periodically, filling
them with new values and grinding on them for a little while.

Now consider a GC'd system, that romps through memory allocating mostly
short-lived objects, then GC's and does it again.

Notice that at a high level of abstraction, these seemingly different
programs are doing the same thing---they're reusing memory to hold
short-lived data. In the case of the FORTRAN array, it's the programmer
who decides that memory will be reused, by repeatedly using the same
language-level object to hold different data sets. In the case of a
GC'd system, it's the garbage collector that decides, by mapping different
language-level objects onto the same range of addresses over time.

From the cache's point of view, however, it doesn't really matter. A
dumb cache will stall a lot because the cache will be missed frequently
as areas of memory are romped through.

A prefetching cache will do better, by fetching the data ahead of time.
(It will tend to be bandwidth-limited, however, if the processor is fast.)
A write-allocate cache with one-word subblock placement will do better,
because it will notice that everything is being written before being read
from, and optimize away the fetching of blocks---it will create them out of
thin air in the cache, rather than fetching data that will only be
overwritten immediately anyway. In all of these cases, there will be
a lot of write backs, and very fast processors will be bandwidth-limited.

A really good GC or really good FORTRAN compiler may be able to exploit
these patterns, if the cache is controllable, by noticing when things
become "dead" and telling the cache not to bother with most of the
write-backs. The situations where they'll be able to do this are different,
though.

To figure out what computer architects ought to do, we need to have a
much better understanding of how programming idioms interact with compilers'
and allocators mappings of objects onto memory.

As an allocator and GC designer, it seems to me that the best bet is to
figure out what's really going on, exploit what I can by controlling object
placement, and hope that it turns out that the same features that are
useful to me are also useful to C and FORTRAN people, because that's where
a lot of the market is. (I'm pretty optimistic, though, because I suspect
a lot of the same behaviors show up in different kinds of systems, just
in different guises.) In the short term, I think it's important to optimize
for the kinds of cache designs that are or will be widely used in the
short term---there's plenty of worthwhile work to be done there.

Frank Adrian

unread,
Jul 28, 1995, 3:00:00 AM7/28/95
to
Paul Wilson (wil...@cs.utexas.edu) wrote:
: To figure out what computer architects ought to do, we need to have a

: much better understanding of how programming idioms interact with compilers'
^^^^^^
: and allocators mappings of objects onto memory.

I must say that I agree with your statement, although you misspelled the
seventh word on the second line of this paragraph. The "m" should be a
"t" instead.

I also apologize for injecting this note of levity into what was an
extremely well-written (and quite accurate) post. Although, when I
read this paragraph, I just couldn't resist :-).
___________________________________________________________________________
Frank A. Adrian ancar technology Object Analysis,
fra...@europa.com PO Box 1624 Design,
Portland, OR 97207 Implementation,
Voice: (503) 281-0724 and Training...
FAX: (503) 335-8976


Henry Baker

unread,
Jul 28, 1995, 3:00:00 AM7/28/95
to
In article <3vac07$p...@info.epfl.ch>, "Stefan Monnier"
<stefan....@epfl.ch> wrote:

> Too true: current systems are designed mainly to run stack-based programs, so
> it's always safer to try to stick as much as possible to the same paradigm.
> Appel's argument about GC and stack being just as fast (while the GC
approach is
> simpler since there *is* a GC anyway) is not convincing enough: the added
> complexity buys you predictability (and it's a "one-time" complexity cost).

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Oh, but it's _not_ a "one-time" complexity cost! Just look at that thousands
of papers over the past 25 years about how to force everything into the
Algol/PLI/Pascal/C/C++ "everything has to be stack-based" model.

Somehow, we keep paying, and paying, and paying for this 'one-time' cost.

---------

[For those of you who will be confused by this posting because of my
previous postings advocating stacks, you should know that the stacks
I was advocating are quite a bit different from the Algol/PLI/Pascal/C/C++
model. If you _have_ to program in one of these dinosaurs, you should
definitely consider using the "Cheney on the MTA"/pushy stack buffer
approach: ftp://ftp.netcom.com/pub/hb/hbaker/CheneyMTA.html (also .ps.Z).]

--
www/ftp directory:
ftp://ftp.netcom.com/pub/hb/hbaker/home.html

Paul Wilson

unread,
Jul 28, 1995, 3:00:00 AM7/28/95
to
In article <3vbl70$b...@fido.asd.sgi.com>,
Mike McDonald <mik...@engr.sgi.com> wrote:
> I should preface my comments with a statement stating I know next to
>nothing about modern allocators and GCs. The last time I studied any GC in
>detail was for a graduate compiler construction class at UIUC in 83/84.
>After I presented my report on Henry Baker's generational (I think) GC, the
>professor stated that GC had nothing to do with compilers.

Your professor was a bozo. Unless he was somebody important, of course :-)

>I wimped out and let it slide instead of challenging him.

Probably a good strategy under the circumstances, given that he was a bozo. :-)

Seriously, though, I think this is a major problem. CS is too
compartmentalized, and there are "compiler courses" most places which give
very short shrift to runtime systems. (They also tend to focus on the
best-understood and hence most boring parts of compiling---e.g., spending
the first 2/3 of the course on parsing, most of the rest on intermediate
code generation and assembly, and the last day on optimization. Not all
compiler courses, of course, but many, especially those taught at lesser
schools by older faculty. Maybe it's changing, but not fast enough for my
tastes.)

>That said,
>In article <3vb382$d...@jive.cs.utexas.edu>, wil...@cs.utexas.edu (Paul


>Wilson) writes:
>|> In article <3vac07$p...@info.epfl.ch>,
>|> Stefan Monnier <stefan....@epfl.ch> wrote:
>|> >[comp.arch added since the thread is drifting towards cache design]

>|> [ ... ]
>
> [...]


>
>|> It's not clear to me that compaction is a win, on average. If
>|> objects are placed correctly in the first place, it may not be
>|> necessary and it may be *bad* for locality on average. The
>|> allocation order and sizes of objects are likely to provide very
>|> strong hints as to which objects will be used together. This info
>|> should be usable for decreasing fragmentation and for increasing locality.
>

> I don't think allocation order and size gets you vary far. First,
>allocation order tells you nothing about usage order.

First, let me me say that I oversimplified a bit. The exact order isn't
what I'm mostly talking about, and isn't usually the most important for
memory hierarchies.

Also, in the best of all possible worlds, more information and more freedom
to use it are always better. In the real world as it is, I think a lot
of the obvious information available to allocators goes unused, and that
allocators could be improved a lot. Ultimately, funky memory hierarchy
designs and snazzy adaptive reorganization of the data may be a win,
but we're not exploiting the hardware and software structures we've got now.

It's also not clear that reorganizing things in an objectwise way in a
GC is the right way to do things. It may well be that the right way
of doing things is mostly independent of whether you have a GC or
an explicitly managed (malloc/free) heap.

>The order of allocation is very dependant on the algoritms used and the
>style of the writer.

To some degree yes, and to some degree no. Many differences don't really
matter much, and many similar behaviors arise in very different systems,
though maybe in different combinations and different proportions. We
need to understand those things before we decide what the appropriate way
of exploiting them is.

(Our allocator survey grinds this axe pretty hard, too. I talk explicitly
about a strategy/policy/mechanism distinction, as opposed to the usual
policy/mechanism distinction. I think that's important, because the
strategies are not understood very well, much less the policies that
are supposed to implement the strategies. The strategies should be
chosen on the basis of a deep understanding of regularities and irregularities
in workloads, which is almost completely absent in the literature on both
allocators and locality of reference. I probably should have made a four-way
distinction: theory/strategy/policy/mechanism, where I mean "theory" in
the vernacular sense of "I think this is basically what's going on", not
in the CS "theory" sense of "I have a cute formalization that may or
may not have anything to do with reality.")

To be more concrete, I think that phase behavior is quite important to
locality. Objects created at *roughly* the same time are likely to be
used for similar purposes, but maybe not if they are of different *types*.
If they are of different types, they are likely to be used for different
purposes---e.g., some may hold intermediate values of a computation, while
others hold results that outlive the phase that creates them.

Assuming we don't know the types of objects being allocated, as conventional
allocators don't, it's reasonable to use size heuristically to discriminate
between different types of objects created at about the same time.
It's not as good as real type information, but it'll probably get most of
the same benefit. (In my personal opinion, that is. Soon I may have data
that say whether I'm right.)

> Changing the order of a couple of C lines can drasticly change the order
> of allocation.

Quite true. On the other hand, it may not matter. For example, a small
change in code may *reverse* the order in which a list is created or
traversed, but that may not matter much for locality. As long as the
items in the list are clustered together in roughly sequential order,
you'll get a ton of spatial locality. If they're interspersed with
other kinds of objects created by different phases, squirreled away here
and there in holes in the heap, you're likely to damage locality.

See our allocator survey for more on this, and profiles of phase behavior
of real programs that suggest that this view is reasonable.

>an example, let's say we want to build a tree structure. How many orderings
>can you come up with to build that tree? To keep it simple, let's just say
>depth first vs breadth first. These two algorithms give completely different
>allocation orders but result in the same logical structure.

Up to a point, I agree. On the other hand, there are several arguments
(and even a little data) that go the other way. Breadth-first and depth-first
actually both have some nice locality properties, and either may result in
pretty good clustering IF you keep the tree nodes belonging to a given tree
together in memory, rather than interspersing them with other data. You
may want to segregated the data indexed by the tree separate from the
nodes of the tree itself (segregation by size will usually do this).

> Allocation order gives us the wrong information. What we really want is
>access order. In this example, the algorithm used to walk the tree is far
>more important.

Yes and no. If the tree is reasonably well localized---by any reasonable
heuristic---you may get most of the benefit of an order specialized for
a particular traversal. For a small tree, this may mean the whole tree
fits in fast memory, where interspersing different kinds of data would
increase its "footprint". For a larger tree, you want something that
clusters the upper nodes together, mostly, and clusters the data it
indexes reasonably well, too. In this case, you'll do well for lots
of different access patterns. Sparse searches of the tree will take
a minimal number of faults, and exhaustive traversals of the tree will
do okay with only a few pages of memory.

A lot of this can be controlled in software, by the choice of the tree
structure, rather than by a GC or fancy hardware. For example, in the
Texas persistent store for C++, we use B+ trees rather than binary trees.
A B+ tree can be viewed as a binary tree where most of the levels have
been collapsed to cluster strongly-related nodes within a page. I think
a lot of data structure work is just optimizing the wrong thing---locality
is increasingly important, and we should be optimizing for that as well
as raw CPU speed.

The real booger for this theory is trees that are created in random order.
I don't think that happens very often, though, or at least it won't if
people are careful to code their programs right.

For one thing, most data simply aren't unpatterned. The patterns may be
unknown, but they're NOT random. Many seemingly different kinds of patterns
have similar locality implications, as long as they're not truly random.
(Yet another axe we grind in the allocator survey.)

>After all, you only allocate something once. You may access it many times.

Yes, but I claim (without proof) that the phase structure of a program
usually gives strong hints as to later access patterns---maybe not the
exact order of accesses, but enough information to cluster objects reasonably
and generate spatial locality.

> Size doesn't give us much useful info either. Just because two
>allocation requests ask for the same amount of memory doesn't mean they
>are the same structure nor that they'll be accessed together.

True, but see the above. Size alone isn't enough, but size and ordering
together can tell you a lot.

>If you get two requests for 32 bytes right in a row, what does that tell
>you? I'd say darn little.

I'd bet big money you're wrong. :-) Of course, that's unfair---I've been
researching this kind of thing for years, and you're raising very good
questions. (In some sense, I am betting big money on this.)

>That 32 byte object could be a string of length 32, an array of 8 pointers,
>or a 3d vector (x,y,z,w) of doubles. The other 32 byte object could be
>something completely different.

It *could* be, but it usually isn't, and that's good enough for real
systems. The majority of the time, two consecutive requests for the
same size *are* to allocate the same type of object, and for a similar
*purpose*, because of the phase structure in most programs.

>
> I'd think a much more useful piece of information that the compiler
>could provide is which structures are going to point to this piece of
>memory the allocator is about to allocate.

I don't think so. Compilers are good at very local analyses, but they're
usually pretty bad at the kinds of things that matter to locality.
(An exception to this is compiler-directed prefetching.) A compiler
may be able to easily look 100 CPU cycles into the future, but less
often 1000, and seldom a million. So you may be able to prefetch the
next few things you need, but the compiler generally won't be able to
tell you how to cluster things for good locality at larger granularities.
(Another exception is simple algorithms over very large arrays. Compilers
for scientific codes can get that right sometimes.)

> After all, you can't access something until you access the location of
> the pointer to that memory.

Yes, but by then it's too late. This is the locality problem of heap-allocated
data---because of the arbitrary indirections, looking a few instructions
ahead can only tell you what happens a few instructions ahead, not the 1,000
or 10,000 you'd really like---or the millions you'd like in the case of
virtual memory systems, to do a disk seek ahead of time. For virtual
memories, there's generally not enough "overlap" to do any good, which is
why virtual memory systems typically only do sequential demand prefetching,
i.e., wait until a fault and grab the next thing, too.

Clustering is the flip side of prefetching, and often much easier to
implement on dumb hardware.

Clustering is where it's at, in my view---you want the right data grouped
together so that a dumb fetch of a block or page will grab a lot of useful
data. It doesn't have to be data that you'll touch in the next 100
instruction cycles. If you use it within 10,000 instruction cycles, it
will usually be good. (For virtual memories, if you use it within the
next *billion* instruction cycles it will usually be good; compilers
are very bad at dealing with patterns on that scale, but runtime systems
can be very good at it.)

The key thing here is that (pre)fetching doesn't have to make *good*
predictions of the exact order of references to do an excellent job.
If it makes *passable* (significantly better than random) predictions,
it often works excellently.

> The type of the object being allocated would also be nice.

Yes. I agree completely---but I think that most systems don't take
advantage of the information that's staring them in the face. Order
and size can be exploited much more than current systems do, and there
are other sources of information that don't require fancy system structures
or specialized hardware.

>Then the compiler could build up some usage tables based upon static
>analysis.

You have to be very careful about this. This is a trap a lot of people
fall into. Just counting the frequencies of usage is often misleading.
It's the ways in which things are used---particularly large-scale
patterns---that are most relevant to memory hierarchies.

This is one of the problems we get because of the overemphasis on
compilers and the almost complete deemphasis of runtime systems. Runtime
systems have a very different kind of information available, and for
some purposes, it's *better* information.

(It may sound like I'm arguing for dynamically adaptive runtime systems,
but I think that's a trap, too---it's often better to take advantage
of known regularities without literally being "adaptive" at runtime.
People often design "adaptive" algorithms that are wrong in subtle
ways, and don't know why they don't work. There are often simple
regularities that can be exploited without any fancy pattern recognition
and adaptation---see our allocator survey for an example.)

>The allocator could then use that info along with the "who points to me"
>info to decide where to place the objects relative to each other.

I think this is a reasonable idea, but not the easiest to implement
fruitfully. (It's been tried, without much success, although that's
often not an indicator of the true merit of an idea.)

> For instance, if I have a type A that includes pointers to objects of
>type B and type C and the compiler determines that everytime I access an
>object of type A, there's a 80% chance that I'll access the B object and
>a 30% chance I'll access the C object, then when I allocate an object of
>type A, B, or C, I can do some interesting things. For instance, when I
>allocate an object of type A, I may want to reserve the space for the
>type B object too but not the type C object. Anytime later when I actually
>go to allocate the type B object, I can return the previously reserved
>space since I'm also given the pointer to the initial A object.

This is a very appealing and seductive idea, but I think it's wrong.
(Many researchers in locality have been seduced by this kind of reasoning,
so don't feel bad :-).)

You have to take scale into account. In locality, it usually doesn't
matter how many times you do things in the very short term---what matters
is *whether* you do things at all on a somewhat longer timescale. (E.g.,
thousands, millions, or billions of instruction cycles.)

This is analogous to the issue of frequency-based vs. recency-based
replacement. A whole lot of people find frequency-based replacement
very appealing, but it seldom works as well as simple LRU replacement.

LFU replacement evicts the least frequently used item, where LRU replacement
evicts the least recently used. LRU is much better, because it's
"timescale relative"---it only notices touches that matter on the
timescale that's relevant to the memory in question.

Consider the following memory access patterns for two different pages:

* * * * * * * * * * * *

****** ******* ******* ********

LRU wins on this because it will usually evict the BOTTOM page, and have
to fault it back into memory three times. LFU loses, because it usually
evicts the top page---which is touched fewer times---and has to fault
it back in 11 times.

The reason why LRU is (very roughly) right and why LFU is (usually) wrong
is that short-term access patterns don't matter nearly as much as medium-
and long-term patterns. When I fault a page or block into memory, it
doesn't matter whether I touch it twice, or 200 times in quick
succession---what matters is whether I fault on it at all. Subsequent
short-term access patterns are irrelevant because the item will be in
memory for a little while for any reasonable policy, and repeated short-term
touches just don't matter. (The last one may matter, because it affects
the LRU ordering and hence which block will be evicted first, but all
of the intervening short-term touches not only don't matter much, they
don't matter *at*all*.)

This illustrates the problem of expecting the compiler to figure out
locality for you---a compiler has a short-term view, and locality is
a "big picture" issue. The same kind of problem comes up when you
rely on the compiler to tell you how often certain things happen---the
compiler can tell you how often certain concrete events are likely
to happen, but it can't tell you which of those things *matter* on
the timescale that matters.

> Mike McDonald
> mik...@engr.sgi.com

Mike McDonald

unread,
Jul 28, 1995, 3:00:00 AM7/28/95
to
I should preface my comments with a statement stating I know next to
nothing
about modern allocators and GCs. The last time I studied any GC in
detail was for
a graduate compiler construction class at UIUC in 83/84. After I
presented my
report on Henry Baker's generational (I think) GC, the professor stated
that GC
had nothing to do with compilers. I wimped out and let it slide instead
of
challenging him. That said,

In article <3vb382$d...@jive.cs.utexas.edu>, wil...@cs.utexas.edu (Paul
Wilson) writes:

|> In article <3vac07$p...@info.epfl.ch>,
|> Stefan Monnier <stefan....@epfl.ch> wrote:
|> >[comp.arch added since the thread is drifting towards cache design]
|>
|> Then I should restate a couple of things: I'm very much pro-GC, for
|> software
|> engineering reasons, but I think that it ultimately costs you a
|> little
|> bit in locality, relative to a similarly good explicit deallocator
|> (malloc/free kind of thing).

I too am in favor of GCs from a sofware engineering point of view.
I've spent
too much of my life already worrying about malloc/free in both my
programs and in
others.


|> It's not clear to me that compaction is a win, on average. If
|> objects
|> are placed correctly in the first place, it may not be necessary and
|> it
|> may be *bad* for locality on average. The allocation order and
|> sizes
|> of objects are likely to provide very strong hints as to which
|> objects
|> will be used together. This info should be usable for decreasing
|> fragmentation and for increasing locality.

I don't think allocation order and size gets you vary far. First,
allocation
order tells you nothing about usage order. The order of allocation is
very
dependant on the algoritms used and the style of the writer. Changing
the order
of a couple of C lines can drasticly change the order of allocation. As


an
example, let's say we want to build a tree structure. How many orderings
can you
come up with to build that tree? To keep it simple, let's just say depth
first vs
breadth first. These two algorithms give completely different allocation
orders

but result in the same logical structure. Allocation order gives us the


wrong
information. What we really want is access order. In this example, the
algorithm

used to walk the tree is far more important. After all, you only


allocate
something once. You may access it many times.

Size doesn't give us much useful info either. Just because two


allocation
requests ask for the same amount of memory doesn't mean they are the
same

structure nor that they'll be accessed together. If you get two requests
for 32
bytes right in a row, what does that tell you? I'd say darn little. That


32 byte
object could be a string of length 32, an array of 8 pointers, or a 3d
vector
(x,y,z,w) of doubles. The other 32 byte object could be something
completely
different.

I'd think a much more useful piece of information that the compiler


could
provide is which structures are going to point to this piece of memory
the

allocator is about to allocate. After all, you can't access something


until you
access the location of the pointer to that memory.

The type of the object being allocated would also be nice. Then the
compiler
could build up some usage tables based upon static analysis. The


allocator could
then use that info along with the "who points to me" info to decide
where to
place the objects relative to each other.

For instance, if I have a type A that includes pointers to objects of


type B
and type C and the compiler determines that everytime I access an object
of type
A, there's a 80% chance that I'll access the B object and a 30% chance
I'll
access the C object, then when I allocate an object of type A, B, or C,
I can do
some interesting things. For instance, when I allocate an object of type
A, I may
want to reserve the space for the type B object too but not the type C
object.
Anytime later when I actually go to allocate the type B object, I can
return the
previously reserved space since I'm also given the pointer to the
initial A
object.

Mike McDonald
mik...@engr.sgi.com

Paul Wilson

unread,
Jul 29, 1995, 3:00:00 AM7/29/95
to
In article <3vc8kj$5...@fido.asd.sgi.com>,
Mike McDonald <mik...@engr.sgi.com> wrote:
>In article <3vbrrf$e...@jive.cs.utexas.edu>, wil...@cs.utexas.edu (Paul

>Wilson) writes:
>|> It may well be that the right way
>|> of doing things is mostly independent of whether you have a GC or
>|> an explicitly managed (malloc/free) heap.
>
> Are we talking about one implementation of malloc/free used for all
>programs or are we talking about customized malloc/free on a per program
>basis?

Mostly the former, but the latter as necessary, and I agree it's likely
to be necessary sometimes. I suspect it won't be necessary often, though,
because I think one allocator can combine strategies so that it works
well for a variety of different program behaviors.

>Trying to
>come up with a general purpose allocator/GC is a quite different problem
>than coming up with a methodology for generating "custom" allocators/GC
>that use program specific knowledge. I'm gathering you're refering to the
>former.

Yes. Custom-generated allocators are probably a good idea for a minority
of programs, but I think we've only scratched the surface of what a
really good allocator can do for most programs. An intermediate strategy
is to have a covering set of two or three good allocators, all of which
usually do well, and at least one of which does really well for almost
any program. Then you can just link against different allocators and see
what works best for your application.

This is fairly common in the sense that some people pick between a very
fast allocator---e.g., Chris Kingley's BSD 4.2 allocator---and a slower
but more space-efficient allocator. But that sort of brutal space/time
tradeoff just shouldn't be necessary. There are some fairly
simple techniques that should make almost any reasonable allocator
reasonably fast. With deferred coalescing, you can probably get most
of the speed advantage of a very simple allocator, and most of the
space advantage of a more sophisticated one.)

>I'm interested in the latter since people are always "comparing" GC to "hand
>optimized" allocation. (It's my experience that very few of the people
>who claim that they can't use a GC due to its high overhead compared to
>the "hand optimized" allocation actually do the optimization. They just
>use malloc instead.)

I agree that this is true in many cases. But some people really do use
hand-tweaked allocation (often bypassing malloc entirely); they use ad
hoc allocators they kruft up for a particular situation. Sometimes this
is a good idea, but I think that most of the time it's not---people
should just find a better general-purpose allocator and use that. This
is especially common in C++, where people go nuts overloading new and
making class-specific allocators for a single kind of object, which often
gets rid of internal fragmentation but *creates* external fragmentation.
Yuck.

(Not that overloading new is a bad idea in general---we do it in our
persistent store for C++ and in our C++ interface to our real-time GC.
But don't try it at home---it's for trained professionals on a closed
course :-), and it shouldn't be necessary just to get decent allocator
performance. Especially since it tends to conflict horribly with
better-motivated reasons for overloading new, such as persistence and
GC.)

Hand-krufted or specialized allocators (like obstacks) do very dangerous
things to the structure of programs, and can lead to *very* weird bugs.
(I've seen some class-specific "pool" allocators that don't work for
subclasses of that class because the sizes are different---aargh! There
goes your nice extensible program design...) If there were better general
allocators around---and appreciated---it wouldn't be so attractive to
monkey with the works.

(Aside to people who do this sort of thing: there ARE some pretty good
allocators out there, such as Doug Lea's segregated fits allocator, which
is FREE. If you don't like the allocator you've got, try a different
one. If it doesn't run on your system (e.g., Lea's runs on UNIXES),
PORT IT---it's not that hard. I hope that soon we'll have a new and
nifty allocator to give away, too, but probably not for a few months.)

>|> >The order of allocation is very dependant on the algoritms used and
>|> >the style of the writer.
>|>
>|> To some degree yes, and to some degree no. Many differences don't
>|> really matter much, and many similar behaviors arise in very different
>|> systems, though maybe in different combinations and different proportions.
>|> We need to understand those things before we decide what the appropriate
>|> way of exploiting them is.
>

> True. I'd also say the corollary "Similar algorithms can have vastly
>different allocation behaviors" is also true.

Yes, but I think a single allocator (or two or three at most) can cover
most of the important behaviors in real programs. I could be wrong. We'll
see.

But again, a lot of the time the differences don't make a difference. If
you can cluster related data---and not intersperse them with unrelated
data---you generally win. It doesn't matter as much what the exact
access patterns are, or which clusters are accessed more often, as long
as birds of a feather flock together. (That is, you can often win by
discriminating between things that are accessed differently, rather than
by predicting how they'll be accessed.) And I think that, by and large,
you can use phased creations of objects as a fair correlate of phased
accesses to those objects, and get a win there too.

This is the basic idea behind "zone" allocators, which allow the programmer
to specify which objects should be lumped together in some subset of memory,
and that works for many programs. On the other hand, I think that a smart
allocator should be able to do that discrimination pretty well, automatically.

(Some zone allocators have problems with external fragmentation, because
free memory can't be shared between zones. A smart allocator should be
able to mostly segregate unrelated objects, but still share large hunks
of free memory---e.g., whole free pages---between the different subheaps.)

>|> [...]


>|> To be more concrete, I think that phase behavior is quite important
>|> to locality. Objects created at *roughly* the same time are likely to
>|> be used for similar purposes, but maybe not if they are of different
>|> *types*.
>|> If they are of different types, they are likely to be used for
>|> different
>|> purposes---e.g., some may hold intermediate values of a computation,
>|> while
>|> others hold results that outlive the phase that creates them.
>

> I'm assuming by "phase behaviour" and "phase structure" you're meaning
>that programs tend to go thru distinct steps, such as the initialization
>phase, the update phase, ...

That's part of it, but those larger phases often have many smaller phases
with distinctive usage of just a few sizes of objects. Phase structure
occurs at a variety of scales, and exploiting small phases could be
a big win.

> (I've printed out your survey papers. I just
>haven't read them yet today.)

Given the page counts, you're forgiven.

>|> Assuming we don't know the types of objects being allocated, as
>|> conventional allocators don't, it's reasonable to use size
>|> heuristically to discriminate between different types of objects
>|> created at about the same time.
>|> It's not as good as real type information, but it'll probably get
>|> most of the same benefit. (In my personal opinion, that is. Soon
>|> I may have data that say whether I'm right.)
>

> Why don't the allocators have the type information? The compilers
>usually do.

It's mostly because of the dumb way that allocators are usually linked
with applications, which has some nice advantages. Type info IS pretty
easy to get, and may be useful. I'm just grinding the axe that current
allocators don't exploit (or don't *intentionally* exploit) a lot of
important information they're already getting. I think we can get most
of the big wins without changing that situation, but if need be, we
can use compiler cooperation.

On the other hand, there are other sources of info that seem to me to
be more promising, such as the call site and the stack height. These
are better clues as to what the program is up to. Barrett and Zorn
have done some work on this kind of thing, as have Demers, Boehm,
Hayes et al. (of Xerox PARC) in GC'd systems. I think these are good
ideas, but not really necessary and maybe a little premature, since we
haven't explored the information normal allocators are given, through
the normal interface---which I think is a gold mine.

>I don't think the allocator needs to know the exact structure of the
>objects but knowing that this call is allocating the same type as that
>previous call seems like it would be useful and easy to implement. Just
>have the compiler assign a unique id to each type and pass that id along
>to the allocator.

Yes, this is easy. I just think it's unnecessary, because successive
allocations of the same size usually are allocations of the same type
of object, even if other phases of the program allcocate different objects
of the same sizes. And if you're going to discriminate further, you may
want to use call site (or call chain) info instead.

>
>|> >an example, let's say we want to build a tree structure. How many
>|> >orderings can you come up with to build that tree? To keep it simple,
>|> >let's just say depth first vs breadth first. These two algorithms
>|> >give completely different allocation orders but result in the same
>|> >logical structure.
>|>
>|> Up to a point, I agree. On the other hand, there are several
>|> arguments (and even a little data) that go the other way.
>|> Breadth-first and depth-first actually both have some nice locality
>|> properties, and either may result in pretty good clustering IF you
>|> keep the tree nodes belonging to a given tree together in memory,
>|> rather than interspersing them with other data. You may want to

>|> segregate the data indexed by the tree separate from the


>|> nodes of the tree itself (segregation by size will usually do this).
>

> I guess I must write weird programs. Most of my data structures tend
>to be about the same size. After the compiler gets done rounding the
>sizes off to double word alignment, I'd guess that there's a large
>redunduncy in the sizes.

I don't think so, from the data I've been looking at. (I don't have hard
numbers on this at the moment---or at least not in the right form yet.)
I'm pretty sure that consecutive objects of the same size are usually
of the same type, due to skews in size usage during small-scale phases.
(Have a look at some of the memory-use profiles in our allocator survey.)

Also, in my understanding, it's usually not the compiler that does the
rounding (although the compiler does padding)---doubleword alignment is
usually enforced by the allocator; at least some of the information is
usually passed along through the allocator interface, although padding
may filter out some of it.

>|> > Allocation order gives us the wrong information. What we really
>|> > want is access order. In this example, the algorithm used to walk
>|> > the tree is far more important.
>|>
>|> Yes and no. If the tree is reasonably well localized---by any
>|> reasonable heuristic---you may get most of the benefit of an order
>|> specialized for a particular traversal. For a small tree, this may
>|> mean the whole tree fits in fast memory, where interspersing different
>|> kinds of data would increase its "footprint". For a larger tree,
>|> you want something that clusters the upper nodes together, mostly,
>|> and clusters the data it indexes reasonably well, too. In this case,
>|> you'll do well for lots of different access patterns. Sparse searches
>|> of the tree will take a minimal number of faults, and exhaustive
>|> traversals of the tree will do okay with only a few pages of memory.
>

> If you can identify which allocs belong to the nodes and which belong
>to the data attached to the node, and if while accessing the nodes of the
>tree you don't need to access the data items, then I'll agree that just
>about any way you walk the tree has about the same locality "niceness"
>as any other. I just don't believe that size and order give you enough
>information to make that fine of a seperation.

Okay, I'll try to come up with some data about small-scale phase behavior
that's more convincing.

>|> A lot of this can be controlled in software, by the choice of the
>|> tree structure, rather than by a GC or fancy hardware. For example,
>|> in the Texas persistent store for C++, we use B+ trees rather than
>|> binary trees. A B+ tree can be viewed as a binary tree where most
>|> of the levels have been collapsed to cluster strongly-related nodes
>|> within a page. I think a lot of data structure work is just optimizing
>|> the wrong thing---locality is increasingly important, and we should
>|> be optimizing for that as well as raw CPU speed.
>

> I totally agree. Memory bandwidth between the various components is
>almost always a bottleneck. Except benchmarks of course. Most of those
>fit in everyones' on chip cache.


>
>|> The real booger for this theory is trees that are created in random
>|> order. I don't think that happens very often, though, or at least
>|> it won't if people are careful to code their programs right.
>

> Wait a minute! Now we're relying on the programmers to do something
>smart? If that were the case, they'd all be writing their own allocators
>by hand for each program instead of just using malloc. But they don't.
>They want something that's easy to use and that will do what they perceive
>as a reasonable job without them having to think about it.

I agree that this sounds like I'm talking out of both sides of my mouth,
which I am, to some degree. But I think data structure designers ultimately
have to take the fall for some things. There are some things that are just
easier to fix in the data structures than in the compiler and/or runtime
system. Too much data structure work is still targeting early 1960's
architectures---no VM, no cache.

It's time data structure and algorithm designers moved into the real world.
Libraries should provide alternative implementations of basic collection
types with different *locality* characteristics for large collections.
It's just way easier for a data structure designer to get this right in
the first place than for a compiler and/or runtime system and/or hardware
to recover information that was thrown away before a line of code got
written. In a sense, a b-tree is just a more natural structure for a
large hierarchical memory than a binary tree is. Of course, for mostly
in-memory use you need to pay more attention to optimizing searches
within a node than database people usually do.

>|> For one thing, most data simply aren't unpatterned. The patterns may
>|> be unknown, but they're NOT random. Many seemingly different kinds of
>|> patterns have similar locality implications, as long as they're not truly
>|> random. (Yet another axe we grind in the allocator survey.)
>

> I think I agree with this. Now, can one method do a reasonable job for
>all of these patterns or do we need a different method for each pattern?

Almost the former, I think---we should be able to deal with most common
patterns pretty well in one allocator, but we may need a couple of backup
allocators that handle patterns that are uncommon but not ignorably
rare, and we may have to resort to customized allocators for really
ornery programs.

>I believe that later is true. If so, how do you tell what the pattern
>is before you encounter it? If you leave it all up to the allocator, it
>has to deduce the pattern from one call at a time and then adjust it
>behavior accordingly. But it's too late then since it has already allocated
>things. I think it needs hints before it starts.

Maybe yes, maybe no. I think that by and large, a fairly simple combination
of fairly simple strategies is likely to work nearly as well for most
programs as a custom strategy. The trick is to identify the best and most
robust strategies and combine them in a way such that they don't undermine
each other. This doesn't have to be explicitly adaptive---it may "just
work out" that one feature comes into play in a more important way in
some circumstances, and another in different circumstances. I have a
hunch (which you're quite reasonable to be skeptical of) that the good
strategies aren't generally diametrically opposed, and can be combined
reasonably well.

(Also, adaptive tricks may be helpful sometimes---sometimes
fragmentation is a cumulative problem due to small-scale phase behavior.
An allocator could conceivably notice the small scale phases and limit
the problem before too many of the troublesome phases occur. I'm hoping
we don't have to do anything that hairy. (Adaptive algorithms tend not
to be very robust, unless you *really* know what you're doing. It's
easy to adapt to one thing and shoot yourself in the foot with respect
to another.)

As an analogy, consider LRU replacement policies. LRU is a simple policy,
but one that "adapts" reasonably well to a very wide variety of memory
referencing patterns exhibited by real programs. You can beat it in
any number of special cases, but on average it's really pretty good.
(A lot better than anybody had a right to expect, given the simplistic
notions of locality that were around when it was proposed.) People
were quite skeptical that LRU was robust enough for real use, but it
turns out that there are some deep reasons why it's very robust.
(At least, compared to FIFO, RANDOM, LFU, and a bunch of other things
that were tried.)

I think that there will be allocator policies that are similarly robust,
and for some reasons that are similar at a very high level of absraction:

1. Programs aren't out to get you; they're artifacts designed to
solve problems, not to foil a memory allocator. As long as you
don't do things that lose big, you'll usually do pretty well.
The bad cases don't appear to occur very often, if you don't
do something stupid and shoot yourself in the foot.

2. Programs exhibit locality in their allocation behavior, as well
as in their memory-referencing behavior. They are patterned in
ways that should be exploitable---we've already got allocators
that work pretty well for most programs, and there's no reason
to think that they're the best we can do. (The process that
resulted in the allocators we've got was not very scientific or
empirical. I have to guess that when we figure out what's really
going on, we can do at least a little bit better.)

3. In terms of locality of reference, at least, it's more important
not to lose big on any common patterns than it is to exploit
every bit of information to the fullest. This has to do with
the granularity issues discussed in my last posting, and is why
LRU almost always works better than LFU. You don't have to make
very precise predictions (and LRU doesn't) as long as you can
make reasonably good discriminations. (LRU does this by not
screwing up small-scale stuff: it NEVER evicts something that's
very recently touched; its mistakes tend to be in the longer-term
patterns, which occur much less often and therefore don't pose
as much of a danger, on average.)

I know this is all pretty vague; you'll have to wait if you want convincing
proof.

>|> >After all, you only allocate something once. You may access it many
>|> times.
>|>
>|> Yes, but I claim (without proof) that the phase structure of a
>|> program usually gives strong hints as to later access patterns---maybe
>|> not the exact order of accesses, but enough information to cluster
>|> objects reasonably and generate spatial locality.
>

> I guess I disagree here. From my experience, in a lot of systems,
>different phases were written by completely different people, sometimes
>completely different orginizations. To assume that they'll have similar
>locality behaviour is too much for me.

I can only say that from the limited data I've seen, and my own intuitions,
I think that a fairly simple combination of strategies can exploit a
wide variety of patterns. As I said earlier, this is partly because
some "different" kinds of patterns can be exploited by a single simple
strategy, and because I don't think the good strategies will turn out
to be diametrically opposed.

>|> Clustering is where it's at, in my view---you want the right data
>|> grouped together so that a dumb fetch of a block or page will grab
>|> a lot of useful data. It doesn't have to be data that you'll touch
>|> in the next 100 instruction cycles. If you use it within 10,000
>|> instruction cycles, it will usually be good. (For virtual memories,
>|> if you use it within the next *billion* instruction cycles it will
>|> usually be good; compilers are very bad at dealing with patterns on
>|> that scale, but runtime systems can be very good at it.)
>

> Within the next 10,000 of your instruction cycles? Maybe. Within a
>billion? No way. Never.

I'm not sure you realize how much faster CPU's are than disks---most
people don't, and it's kind of amazing when you think about it.

Consider a 100 MIPS processor with 32 MB of RAM and a 10 msec average
disk access time (seek plus rotational latency plus transfer time).
This is what I've got on my desk. If I have 4KB pages and fetch one
page at a time at 10msec each, it takes a while to fault in a memory
full of pages and flush all of the previously resident pages. (For
simplicity, assume I do something nasty like straight-line accesses
of a huge amount of memory.) This is roughly the timescale relevant
to virtual memory replacement policies.

At 10msec per page, I can fetch 100 pages in a second. 32MB of 4KB
pages is 4000 pages, or 40 seconds replacement time, *minimum*.
That is, a page will stay in memory for at least 40 seconds after
it is last touched, even if I'm touching a different nonresident
page at every memory reference, i.e., if I have no usable locality and
am thrashing my brains out.

With a 40-second replacement time, at 100 million instructions a
second, that's 4 billion instructions between the last time I touch
a page and the time it gets evicted.

Now, this is all oversimplified. If I'm really doing straight-line
accesses or effective prefetching, I may be limited only by my SCSI
bandwidth, and be able to get something close to 4MB/sec. In that
case, I'm under half a billion instructions. And I may lose some
memory to nonpageable OS stuff, so maybe I only get a quarter billion.

On the other hand, the gap between CPU and disk speeds is widening;
there are processors that are several times faster than mine, but
few disks that are that many times faster than my disk. (RAIDs
help, bandwidth-wise, but if a significant fraction of my pageins
require seeks, I'm hosed.) And most of the time, there's some locality,
so you don't tend to flush everything as fast as possible. So billions
of instructions is about the right ballpark, for fast machines in the
next few years.

The bottom line is that for VM prefetching to be very effective at all,
I've got to make predictions at least a hundred thousand instructions
in advance. On the other hand, it's just fine to predict things many
millions of instructions in advance, because bringing them in early
won't waste memory for very long on the scale of VM replacement.

This implies that clustering is more attractive than true prefetching,
because if I can group together things that are touched at very roughly
the same time (e.g., within a few seconds of each other), I win. I
don't have to make very short-term predictions, and I don't have to
predict very precisely.

> All of those other programs running with bad locality are going to
>use that LRU algorithm to flush your data back out to disk.

Nope, the disk can't keep up, and the processes will just be blocked
most of the time.

>Heck, you'd probably flush them out yourself.

Again, the disk just won't go that fast---I can't flush pages any faster
than I can fault pages in.

>I'm assuming you are doing something in that billion cycles.

Not usually---these days I (and many people) usually only have one
dominant process that uses most of the memory, and if you're paging
much at all, you're thrashing. There are lots of machines that
are more time-shared, but they often have more RAM as well, and the
trend is toward fewer and fewer big processes per processor.

>|> The key thing here is that (pre)fetching doesn't have to make *good*
>|> predictions of the exact order of references to do an excellent job.
>|> If it makes *passable* (significantly better than random)
>|> predictions, it often works excellently.
>|>
>|> > The type of the object being allocated would also be nice.
>|>
>|> Yes. I agree completely---but I think that most systems don't take
>|> advantage of the information that's staring them in the face. Order
>|> and size can be exploited much more than current systems do, and
>|> there are other sources of information that don't require fancy system
>|> structures or specialized hardware.
>

> I don't think type does either. (See above.)

Agreed. Type info isn't really hard to get. I just don't think it's
much better than the info allocators are getting already; if it turns
out to be, I'll certainly be happy to use it.


>
>|> This is one of the problems we get because of the overemphasis on
>|> compilers and the almost complete deemphasis of runtime systems.
>|> Runtime
>|> systems have a very different kind of information available, and for
>|> some purposes, it's *better* information.
>

> True. But the problem with runtime info is that you've already
>committed yourself before the patterns are apparent. I think local, static
>analysis can help here. True, it doesn't help with global issues. But
>once again, some knowledge is often better than none. (It's also often worse.)
>

In principle, I think I agree. It's just that in practice, we should be
able to do a good job with the information at hand, once we understand
its significance. Later we can go looking for more information to get
further improvements. You have to be careful that gathering the and
exploiting the information isn't more expensive than it's worth, which
is why I'm biased towards simple solutions--but based on sophisticated
knowledge about why they work. That's where the hard work comes in...

Henry Baker

unread,
Jul 31, 1995, 3:00:00 AM7/31/95
to
Wilson) wrote:

> Then I should restate a couple of things: I'm very much pro-GC, for software
> engineering reasons, but I think that it ultimately costs you a little
> bit in locality, relative to a similarly good explicit deallocator
> (malloc/free kind of thing).

I think that you'd have to be very specific about what your assumptions
were in order to back this conclusion up. You'd have to include things
like how smart the compiler was about killing dead things promptly,
and what language you are writing in, and what help the language/type
system gives to the programmer to do 'explicit' deallocation.

For a large fraction of 'vanilla' C/C++ code, however, I would strongly
disagree. Most people's attempts at local 'optimizations' by doing
explicit deallocation will backfire, and lead to _both_ insecure code
_and_ lower performance. In most professionally constructed GC systems,
the GC architect has thought a _lot more_ about locality and performance
issues than most application writers, and is also in a much better position
to do something positive about locality than most application writers.

> Unfortunately, comparisons are quite difficult because the state of
> allocator research and development is a bit of a mess. We don't know
> much more than we did in 1971, which wasn't a whole lot.

Agreed.

> >Paul Wilson <wil...@cs.utexas.edu> wrote:
> >] Ken Dickey <Ke...@apple.com> wrote:
> >] >At 11:38 AM 95/07/26 -0400, Scott McKay wrote:
> >] >>Another thing to note is that compacting GC's usually dramatically
> >] >>reduce ther overall working set of programs, which in turn reduces the
> >] >>number of page faults a program might take. ...
> >
> >Is there hard data on this one ?
>
> Essentially none. Compaction is overrated, by the way. It can do as much
> damage as good, if it's not done *right*. As far as comparisons between
> conventional and GC'd systems go, there is essentially zero data on locality
> effects, and there's every reason to think that locality effects are
> important.

With a high quality VM system and log-structured file systems, I'm afraid
you are right. In these cases, the underlying VM has already gained most
of the advantage that compaction would offer. One could gain a bit more
memory utilization with compaction, but the cost of actually doing the
compaction may not pay for itself. With modern caches, I'm not sure if
there is any gain at all on that front.

IMHO the major gain from compaction is to defragment the address space,
but as Paul's recent work has shown, one may not have to do this very
often.

> For more info on this, see the paper allocsrv.ps in our repository of papers
> on memory management. (ftp.cs.utexas.edu, in the directory pub/garbage.)

Excellent paper, and highly recommended.

> GC's (except for reference
> counting GC's) tend to do something that's intrinsically nasty for locality;
> they pollute the cache with short-lived objects and can't *reuse* that memory
> until a garbage collection detects that the memory is reusable. You can limit
> the damage this causes, but you can't eliminate it entirely.

This is correct as far as the _address space_ is concerned, but completely
wrong as far as cache behavior is concerned. Reinhold's wonderful thesis

Reinhold, Mark B. "Cache Performance of Garbage-Collected Programming
Languages". MIT/LCS/TR-581, Sept. 1993. (Also PLDI'94??)

shows how a gc makes _excellent_ use out of a write-allocate cache, and
that short-lived objects cause no problems at all, because _they live
and die in the cache and need never cause any memory traffic at all_.

So the problem is bad HW architectures, not GC, per se. (When, oh when,
will VM architects give us 'write-allocate' VM? CS people have known
that it is important for 30 years, but somehow the OS people have never
gotten the word.)

> Yes. On the other hand, I think one of the things that has limited progress
> is a failure to focus on the fundamentals---namely program behavior and
> what an allocator (or GC) can do to map that behavior nicely onto memory.
> Architects tend to have a very naive view of locality, which is that it's
> like a natural phenomenon that either is or isn't there. In fact, it's a
> product of patterned program behavior interacting with patterned allocator
> (or GC) strategies, and these must be studied in relative isolation to
> figure out what's really going on. I believe that there are common and
> simple patterns in program behavior that have not been noticed or exploited,
> and that this is why there's been very little progress in the study of
> allocators and the study of locality.

Excellent point. Unfortunately, HW architects don't have a lot of
control over what SW people run on their machines. It would be a
shame if HW architects go to a lot of trouble, however, to give people
something that can be much more effectively accomplished with better
SW.

> >This is the "understanding" problem: if the cace can notice that reading the
> >line is not necessary when allocating a new object, it means the cache
> >understands better the way the GC-allocator works. Your point about the
> >memory bandwidth of the write-backs could be argued against in saying that
> >there should be a way for the cache to notice that most of the write-backs
> >aren't necessary (I'm assuming here that the generations' sizes have been
> >carefuly chosen so that they "fit the memory hierarchy" (whatever this
> >really means in terms of relative sizes): in such a case most write-backs
> >will be for "old-space" objects that have been recognized as dead already).
> >Since there will still be write-backs of objects that *are* dead but haven't
> >been recognized as such yet, I guess manual allocators still have an edge.

Manual allocators have exactly the same trouble that GC's do if they
have no way to tell the HW that a particular cache line is now dead.
It's way past time for HW to have some ability for the SW to inform it
that things are dead. I have argued elsewhere for things like a
"last load" instruction, which informs that HW that after supplying
the contents of this memory location, it is now dead, and need not
be written back. In conjunction with a write-allocate cache, much
memory traffic can be eliminated by the compiler.

> I don't think so. For small caches, by the time you've compacted data it's
> too late---you touched too much memory in between GC's, because you couldn't
> reuse memory immediately.

Wrong intuition. Worry more about cache behavior. See Reinhold's thesis again.

> >For lack of hard data, this is probably blowing hot air, tho !
> >But the point still holds: it'd probably be a good idea to provide ways to
> >tell the memory hierarchy what's going on (like a "trash" instruction to
> >indicate that some memory region can be trashed (to free the old-space) or a
> >destructive read (can be used for linear-programming, or for stack-pops)).

^^^^^^^^^^^^^^^^^^

I think you mean 'linear logic style of programming' here!

Henry Baker

unread,
Jul 31, 1995, 3:00:00 AM7/31/95
to
In article <3vbl70$b...@fido.asd.sgi.com>, mik...@engr.sgi.com (Mike
McDonald) wrote:

> I should preface my comments with a statement stating I know next to
> nothing
> about modern allocators and GCs. The last time I studied any GC in
> detail was for
> a graduate compiler construction class at UIUC in 83/84. After I
> presented my
> report on Henry Baker's generational (I think) GC, the professor stated
> that GC
> had nothing to do with compilers. I wimped out and let it slide instead
> of
> challenging him.

If you are referring to my April 1978 CACM paper, it does _not_ describe
what people would now call a 'generational' GC, but a straight-forward copying
GC which is scheduled in peculiar way.

[There seems to be a lot of confusion about this. Coplien's "Advanced
C++ Programming Styles and Idioms", Addison-Wesley, 1992, is also very
confused about this. He describes my 1978 GC as a "mark-sweep" gc, which
characterization most GC people would disagree with.]

If anyone is interested, this paper is available online at

ftp://ftp.netcom.com/pub/hb/hbaker/RealTimeGC.html (also .ps.Z).

> I too am in favor of GCs from a sofware engineering point of view.
> I've spent
> too much of my life already worrying about malloc/free in both my
> programs and in
> others.

What, you don't work for 'Purify'?? ;-)

According to Jon Bentley and Bjarne Stroustrup, rewriting malloc/free
is one of the most rewarding forms of entertainment for SW people. ;-)

> I don't think allocation order and size gets you vary far. First,
> allocation

> order tells you nothing about usage order. The order of allocation is
> very


> dependant on the algoritms used and the style of the writer. Changing
> the order
> of a couple of C lines can drasticly change the order of allocation.

Exactly!! Which is why many 'studies' aren't worth the paper they are
printed on.

-----

My own pet peeve is _weak_ language type systems like Ada and C++ that
_force_ you to break apart a single object into a number of smaller subobjects
_that are separately mallocated_ even when you know how to manage the
object as one object in assembly language. Stroustrup's C++ is especially
'pointer-happy', as a perusal of the following two books will show:

Stroustrup, Bjarne. "The C++ Programming Language, Second Edition". Addison
Wesley, 1991.

Ellis, M.A. and Stroustrup, B. "The Annotated C++ Reference Manual".
Addison Wesley, 1990.

Whenever things get tough, C++ wimps out and forces another pointer indirection.

Of course, this 'solves' the problem for the C++ implementor, but throws
a much larger problem over the fence onto the poor application writer
and/or memory manager, who is now forced to deal with a much higher load
of smaller objects.

Whoever said that C++ doesn't force you to pay for features you don't use
hasn't ever programmed in C++ to any extent.

Henry Baker

unread,
Jul 31, 1995, 3:00:00 AM7/31/95
to
In article <3vc8kj$5...@fido.asd.sgi.com>, mik...@engr.sgi.com (Mike
McDonald) wrote:

> Yep. According to this professor, compiler construction is only about
> parsing
> with error recovery. Maybe a little pinhole optimization too.
^^^^^^^

Make that 'pinhead' optimization [something that Zippy would do, e.g.].

Mike McDonald

unread,
Jul 31, 1995, 3:00:00 AM7/31/95
to
In article <3vcne8$e...@jive.cs.utexas.edu>, wil...@cs.utexas.edu (Paul Wilson) writes:

|> Yes. Custom-generated allocators are probably a good idea for a minority
|> of programs, but I think we've only scratched the surface of what a
|> really good allocator can do for most programs. An intermediate strategy
|> is to have a covering set of two or three good allocators, all of which
|> usually do well, and at least one of which does really well for almost
|> any program. Then you can just link against different allocators and see
|> what works best for your application.

In my experience, the system usually only comes with one. If you want to try
something different, you're on your own. That tends to lead people to build their
own from scratch rather than experiment with using already proven ones. I think
this is a pretty sad state of affairs.


|> >I'm interested in the latter since people are always "comparing" GC to "hand
|> >optimized" allocation. (It's my experience that very few of the people
|> >who claim that they can't use a GC due to its high overhead compared to
|> >the "hand optimized" allocation actually do the optimization. They just
|> >use malloc instead.)
|>
|> I agree that this is true in many cases. But some people really do use
|> hand-tweaked allocation (often bypassing malloc entirely); they use ad
|> hoc allocators they kruft up for a particular situation. Sometimes this
|> is a good idea, but I think that most of the time it's not---people
|> should just find a better general-purpose allocator and use that. This
|> is especially common in C++, where people go nuts overloading new and
|> making class-specific allocators for a single kind of object, which often
|> gets rid of internal fragmentation but *creates* external fragmentation.
|> Yuck.

A good number of the "hand optimized" allocators I've seen people create
actually perform worse than the standard malloc that came with their system. I
think a lot of this is due to the fact that we all think we know how our code
behaves but we've never bothered to verify it. As a result, we build upon
assumptions that aren't true. As your paper points out in the case of random
traces.

|> Hand-krufted or specialized allocators (like obstacks) do very dangerous
|> things to the structure of programs, and can lead to *very* weird bugs.
|> (I've seen some class-specific "pool" allocators that don't work for
|> subclasses of that class because the sizes are different---aargh! There
|> goes your nice extensible program design...) If there were better general
|> allocators around---and appreciated---it wouldn't be so attractive to
|> monkey with the works.

This was when I got called in. The built a kludgy system that was unreliable
and they needed someone to "fix" the bug. They usually didn't like the answer
that their whole system was the "bug". :-)

|> (Aside to people who do this sort of thing: there ARE some pretty good
|> allocators out there, such as Doug Lea's segregated fits allocator, which
|> is FREE. If you don't like the allocator you've got, try a different
|> one. If it doesn't run on your system (e.g., Lea's runs on UNIXES),
|> PORT IT---it's not that hard. I hope that soon we'll have a new and
|> nifty allocator to give away, too, but probably not for a few months.)

The state of affairs in software "engineering" just plain sucks! (I can't think
of any nicer way to put it.) People go off and study basic problems, spending
lots of time and effort into charactterising the problem, coming up with
solutions and to what effect? Does anyone use their results? Has anyone heard of
their work? Nope, we all just reinvent everything from scratch everytime. Pitiful.

|> > True. I'd also say the corollary "Similar algorithms can have vastly
|> >different allocation behaviors" is also true.
|>
|> Yes, but I think a single allocator (or two or three at most) can cover
|> most of the important behaviors in real programs. I could be wrong. We'll
|> see.

I'll agree to that for now.

|> > (I've printed out your survey papers. I just
|> >haven't read them yet today.)
|>
|> Given the page counts, you're forgiven.

I've just about finished reading it. (Expect for pages 48 thru 56. The bottom
half of each page didn't get printed for some reason. I'll have to reprint them.)

I'm a bit confused by one thing. In the beginning of the paper, you state that
reducing fragmentation is the most important thing for a allocator to do. In this
thread, I get the impression that you now think locality is more important. Care
to elaborate?

|> It's mostly because of the dumb way that allocators are usually linked
|> with applications, which has some nice advantages. Type info IS pretty
|> easy to get, and may be useful. I'm just grinding the axe that current
|> allocators don't exploit (or don't *intentionally* exploit) a lot of
|> important information they're already getting. I think we can get most
|> of the big wins without changing that situation, but if need be, we
|> can use compiler cooperation.

I'll agree that we're not using the info we have to the fullest. But if type
info is easy to get, why not see if it's helpful while we're at it? [I know. One
has to sleep once in a while too. :-)]

|> On the other hand, there are other sources of info that seem to me to
|> be more promising, such as the call site and the stack height. These
|> are better clues as to what the program is up to. Barrett and Zorn
|> have done some work on this kind of thing, as have Demers, Boehm,
|> Hayes et al. (of Xerox PARC) in GC'd systems. I think these are good
|> ideas, but not really necessary and maybe a little premature, since we
|> haven't explored the information normal allocators are given, through
|> the normal interface---which I think is a gold mine.

I found these ideas intrigging! I'm taking that the point of these is that if
the allocator is called from the some location on two different occasion, it's
probably allocating the same type of object (or a "closely" related one). (Your
paper never explicitly states this.)


|> Yes, this is easy. I just think it's unnecessary, because successive
|> allocations of the same size usually are allocations of the same type
|> of object, even if other phases of the program allcocate different objects
|> of the same sizes. And if you're going to discriminate further, you may
|> want to use call site (or call chain) info instead.

But two succesive calls of different size may also be of the same type. A call
to allocate one of foo look quite different to malloc than a call to allocate
five of foo. But the types are the same with, I'd guess, similar death times.


|> Also, in my understanding, it's usually not the compiler that does the
|> rounding (although the compiler does padding)---doubleword alignment is
|> usually enforced by the allocator; at least some of the information is
|> usually passed along through the allocator interface, although padding
|> may filter out some of it.

Who requires the alignment varies. The allocator may enforce a larger alignment
policy inorder to make its job easier. The compiler may enforce an alignment
policy due to architectural reasons. Or it may be a combination of both. The
allocator had better enforce the compiler/libraries at a minimum.


|> I agree that this sounds like I'm talking out of both sides of my mouth,
|> which I am, to some degree. But I think data structure designers ultimately
|> have to take the fall for some things. There are some things that are just
|> easier to fix in the data structures than in the compiler and/or runtime
|> system. Too much data structure work is still targeting early 1960's
|> architectures---no VM, no cache.
|>
|> It's time data structure and algorithm designers moved into the real world.
|> Libraries should provide alternative implementations of basic collection
|> types with different *locality* characteristics for large collections.
|> It's just way easier for a data structure designer to get this right in
|> the first place than for a compiler and/or runtime system and/or hardware
|> to recover information that was thrown away before a line of code got
|> written. In a sense, a b-tree is just a more natural structure for a
|> large hierarchical memory than a binary tree is. Of course, for mostly
|> in-memory use you need to pay more attention to optimizing searches
|> within a node than database people usually do.

|> I think that there will be allocator policies that are similarly robust,


|> and for some reasons that are similar at a very high level of absraction:
|>
|> 1. Programs aren't out to get you; they're artifacts designed to
|> solve problems, not to foil a memory allocator. As long as you
|> don't do things that lose big, you'll usually do pretty well.
|> The bad cases don't appear to occur very often, if you don't
|> do something stupid and shoot yourself in the foot.

As you rpaper points out, we really don't know what the "good" things are to
exploit nor do we know what the "bad" things are to avoid. We know a few bad
things and a few things that seem to work good but we really don't know
why. Basicly, we've been lucky.

|> 2. Programs exhibit locality in their allocation behavior, as well
|> as in their memory-referencing behavior. They are patterned in
|> ways that should be exploitable---we've already got allocators
|> that work pretty well for most programs, and there's no reason
|> to think that they're the best we can do. (The process that
|> resulted in the allocators we've got was not very scientific or
|> empirical. I have to guess that when we figure out what's really
|> going on, we can do at least a little bit better.)
|>
|> 3. In terms of locality of reference, at least, it's more important
|> not to lose big on any common patterns than it is to exploit
|> every bit of information to the fullest. This has to do with
|> the granularity issues discussed in my last posting, and is why
|> LRU almost always works better than LFU. You don't have to make
|> very precise predictions (and LRU doesn't) as long as you can
|> make reasonably good discriminations. (LRU does this by not
|> screwing up small-scale stuff: it NEVER evicts something that's
|> very recently touched; its mistakes tend to be in the longer-term
|> patterns, which occur much less often and therefore don't pose
|> as much of a danger, on average.)

Ah, the MinMax philosophy. Don't try to win, just try to keep from losing. (And
hope the other guy goofs up.) This philosophy never has appealed to me. I do
admit though that it usually does work.


|> >|> Clustering is where it's at, in my view---you want the right data
|> >|> grouped together so that a dumb fetch of a block or page will grab
|> >|> a lot of useful data. It doesn't have to be data that you'll touch
|> >|> in the next 100 instruction cycles. If you use it within 10,000
|> >|> instruction cycles, it will usually be good. (For virtual memories,
|> >|> if you use it within the next *billion* instruction cycles it will
|> >|> usually be good; compilers are very bad at dealing with patterns on
|> >|> that scale, but runtime systems can be very good at it.)
|> >
|> > Within the next 10,000 of your instruction cycles? Maybe. Within a
|> >billion? No way. Never.
|>
|> I'm not sure you realize how much faster CPU's are than disks---most
|> people don't, and it's kind of amazing when you think about it.

A LOT! But we solve most people's problem by making them buy enough RAM to hold
their dataset. To my mind, the more important ratios are secondary cache to
main memory, and primary chache to secondary cache. I think a good allocator
should worry about not about making sure that the working set fits in main memory
but in the caches. Related things should get allocated on the same 4Kb page. I
think you might need more info that you can currently get for the malloc
interface for this.

|> On the other hand, the gap between CPU and disk speeds is widening;

The gap between the CPU and memory (whether cache or main memory) is also
widening at the same rate. So an on-chip cache miss is bad. Missing the secondary
cache is a disaster.

|> This implies that clustering is more attractive than true prefetching,
|> because if I can group together things that are touched at very roughly
|> the same time (e.g., within a few seconds of each other), I win. I
|> don't have to make very short-term predictions, and I don't have to
|> predict very precisely.

But not just clustering into main memory. You have to get the items to cluster
properly in the caches too. (I take it from your current interest in locality
that you're aware of this problem.)


|> > True. But the problem with runtime info is that you've already
|> >committed yourself before the patterns are apparent. I think local, static
|> >analysis can help here. True, it doesn't help with global issues. But
|> >once again, some knowledge is often better than none. (It's also often worse.)
|> >
|>
|> In principle, I think I agree. It's just that in practice, we should be
|> able to do a good job with the information at hand, once we understand
|> its significance. Later we can go looking for more information to get
|> further improvements. You have to be careful that gathering the and
|> exploiting the information isn't more expensive than it's worth, which
|> is why I'm biased towards simple solutions--but based on sophisticated
|> knowledge about why they work. That's where the hard work comes in...

I totally agree.

Mike McDonald
mik...@engr.sgi.com

Thomas Wang

unread,
Jul 31, 1995, 3:00:00 AM7/31/95
to
Paul Wilson (wil...@cs.utexas.edu) wrote:

> The memory lost to fragmentation in conventional allocators appears to be
> widely overestimated---at least for the best allocators. This is partly
> because the allocator research area is such a mess that people use the
> wrong allocators (e.g., next fit) or don't tweak them in the right ways
> (e.g., getting rid of footer overhead for boundary tags).

> Our recent results show good allocators having only about 10% fragmentation.
> This is comparable to the losses you get just from a header word and rounding
> object sizes up to a word or double word for architectural reasons.

> (Some caveats are in order---this is for a set of eight varied C and C++
> programs, but there are lots of kinds of programs that aren't represented
> in *any* small set, and none of our programs runs for a very long time.)

For some contrasting point, our commercial DBMS server initially had some
run-away fragmentation. We used a work around of keeping an internal free
list that is never freed. Frequent usage of mega-byte sized buffers
contributed to the memory fragmentation problem. You know, it is going to be
these large multi-million dollar projects that are going to have this kind
of problems. On second thought, multi-media applications can run into
fragmention problem as well.

> It's not clear to me that compaction is a win, on average. If objects
> are placed correctly in the first place, it may not be necessary and it
> may be *bad* for locality on average. The allocation order and sizes
> of objects are likely to provide very strong hints as to which objects
> will be used together. This info should be usable for decreasing
> fragmentation and for increasing locality.

I believe the most realistic control on fragmentation can be achieved by
giving language writer access to virtual memory function to free memory pages.
Making large fragments not take up memory space is such an obvious idea.
This solution should knock out the dirty cache problem, as well as be
applicable to conservative collectors.

A master thesis quality paper can be done by implementing this scheme on AIX,
then do some performance analysis.

Regards.


> Paul R. Wilson, Comp. Sci. Dept., U of Texas @ Austin (wil...@cs.utexas.edu)


-Thomas Wang (Software Engineer, Enterprise Objects Program)
wa...@cup.hp.com http://hpeopose.cup.hp.com/~wang/wang.html


Paul Wilson

unread,
Aug 1, 1995, 3:00:00 AM8/1/95
to
In article <3vjb55$s...@fido.asd.sgi.com>,

Mike McDonald <mik...@engr.sgi.com> wrote:
>In article <3vcne8$e...@jive.cs.utexas.edu>, wil...@cs.utexas.edu (Paul Wilson) writes:
>
> In my experience, the system usually only comes with one [allocator]. If

>you want to try something different, you're on your own. That tends to lead
>people to build their own from scratch rather than experiment with using
>already proven ones. I think this is a pretty sad state of affairs.

Yes, it's pathetic. We (and Ben Zorn and Phong Vo) intend to fix this
sometime in the next year. We're trying to establish a standard format
for various kinds of allocator and program traces, so we can distribute
tracing, profiling, and simulation tools that will help people help
themselves. (For example, my group is working up a set of simple allocators
that are slow, but implement a variety of very different policies. This
will let people figure out what *policy* they want, and then find or build
a fast allocator that implements that policy.)

> [ ... ]


>
> I'm a bit confused by one thing. In the beginning of the paper, you state
>that reducing fragmentation is the most important thing for a allocator to
>do. In this thread, I get the impression that you now think locality is
>more important. Care to elaborate?

We should have been clearer on this. Avoiding pathological fragmentation
is essential, but after that, locality is more important than a modest
amount of fragmentation. Also, all other things being equal, more
fragmentation costs you more locality.

The simple pathological case for (LIFO-ordered) next fit (first fit w/roving
pointer a la Knuth) illustrates this.

Suppose you just have a loop that frees a very large short-lived object,
allocates a long-live small object, and allocates another short-lived large
object (always of the same size), you're in trouble in terms of fragmentation
*and* locality. In this case, every time you allocate a small object, it
takes a little chunk out of the big one you just freed. Then when you
allocate the next big one, that free block isn't quite big enough anymore,
so you have to come up with a whole new big block, e.g., by asking the OS
for more memory.

Consider the case where the large objects are 1MB pixelmaps, used briefly
to store a frame of graphics as it's generated, just long enough to generate
and display an image, and the small ones are something else that you save
for a long time, say, 16 bytes each. In this case, each of the little
objects end up costing you about 1 MB each, because they wreck a 1 MB
free space even though you're only using 16 bytes.

You're very likely to simply use so much memory that the program fails
from lack of swap space. It's also likely to run fairly slowly, on
a fast machine, because you have to page out all of the wasted space---it's
dirty because a short-lived graphic image was briefly stored there.
So in this extreme case, you lose in both overall memory usage and in
locality.

Even if the program stabilizes after a while and stops doing this, so
that you don't exhaust swap space and crash, you're likely to lose big
in locality. You end up with the small objects scattered around, so
that most of them are isolated on a page with a hunk of a free block.
If you keep touching the little objects, they'll keep costing you a page
of RAM each. For 4KB pages, this means that they take up 256 times as
much RAM as is actually needed, which is quite a lot. (And this locality
cost is likely to come up in many more cases, because this applies whenever
the large objects are bigger than a page, and to lesser degrees with
smalle sizes.)

As Thomas Wang and Henry Baker have pointed out, there are memory hierarchy
tweaks you can do to avoid the lost memory, by essentially telling the
VM, or a fancy cache, that it doesn't need to store the bits in the
freed objects. (This is similar to something I proposed around 1989
in a SIGPLAN Notices paper.) This is a good thing to be able to do,
all other things being equal, but for a variety of practical reasons
(e.g., whether OS and hardware fixes are cheap enough in speed and
hardware costs), it's better to avoid resorting to those things in cases
where there's a simple software fix---e.g., avoid using next fit.

I think that in most cases, relying on an OS or hardware fix will not
be necessary to get excellent performance. It's unclear what fraction
of the time you won't be able to do this, and how justified (and practical)
fancier schemes are. We need to sort our basic issues of fragmentation
and locality first.

>|> Yes, this is easy. I just think it's unnecessary, because successive
>|> allocations of the same size usually are allocations of the same type
>|> of object, even if other phases of the program allcocate different objects
>|> of the same sizes. And if you're going to discriminate further, you may
>|> want to use call site (or call chain) info instead.
>
> But two succesive calls of different size may also be of the same type.
>A call to allocate one of foo look quite different to malloc than a call
>to allocate five of foo. But the types are the same with, I'd guess, similar
>death times.

My guess is that this *isn't* a good guess, because the relevant *purposes*
of the objects aren't reflected in the types in this case. Suppose, for
example, you have a bunch of complex numbers represented as structs.
If they're allocated on the heap, that says something different about
likely access patterns than if they're allocated in a big array. This
seems to me to be one of the cases where the relevant information may be
available to the runtime system but not the compiler. I'd likely be
wrong in the case of variable-sized arrays. (GCC, for example, seems
to allocate certain arrays for a certain purpose at a certain place
in its characteristic phases, but the arrays are a different size for
each occurrence of this kind of phase.) In this case, I think call
site info is more appropriate than type info (because it's specific to
the kind of phase). My hunch is that the call site info won't actually
provide a big win in most cases, however. (For example, for GCC, the
way those arrays are used turns out not to be a problem for a good
allocator, but GCC may not be representative.)

>|> Also, in my understanding, it's usually not the compiler that does the
>|> rounding (although the compiler does padding)---doubleword alignment is
>|> usually enforced by the allocator; at least some of the information is
>|> usually passed along through the allocator interface, although padding
>|> may filter out some of it.
>
> Who requires the alignment varies. The allocator may enforce a larger
>alignment policy inorder to make its job easier.

Right. But in this case, the allocator-writer has a choice. For example,
Doug Lea's allocator uses 8-byte alignment, when 4-byte could be used
on many platforms. I think it's worth customizing the allocator to
use a smaller alignment unit. It makes compiling the allocator very
slightly more difficult, but it's well worth it. For good allocator
designs, I don't think it'll cost you much in terms of runtime efficiency.

> Ah, the MinMax philosophy. Don't try to win, just try to keep from losing.
>(And >hope the other guy goofs up.) This philosophy never has appealed to me.
>I do admit though that it usually does work.

I think that it's particularly appropriate in this domain, since the
"adversaries" aren't really adversaries, and since the granularity issues
(discussed two postings back) are favorable.

>|> >|> Clustering is where it's at, in my view---you want the right data
>|> >|> grouped together so that a dumb fetch of a block or page will grab
>|> >|> a lot of useful data. It doesn't have to be data that you'll touch
>|> >|> in the next 100 instruction cycles. If you use it within 10,000
>|> >|> instruction cycles, it will usually be good. (For virtual memories,
>|> >|> if you use it within the next *billion* instruction cycles it will
>|> >|> usually be good; compilers are very bad at dealing with patterns on
>|> >|> that scale, but runtime systems can be very good at it.)
>|> >
>|> > Within the next 10,000 of your instruction cycles? Maybe. Within a
>|> >billion? No way. Never.
>|>
>|> I'm not sure you realize how much faster CPU's are than disks---most
>|> people don't, and it's kind of amazing when you think about it.
>
> A LOT! But we solve most people's problem by making them buy enough RAM to
>hold their dataset.

I think this is true for many people, but not the vast majority. Most
computers out there are PC's without tons of RAM---and this promises to
be even more true in the next few years, because sales of PC's are taking
off, and RAM production may not be able to keep up.

On the other hand, cache misses are increasingly important, too, as CPU
speeds go up amazingly quickly. So all of this is increasingly important.

>To my mind, the more important ratios are secondary cache to main memory,

>and primary cache to secondary cache. I think a good allocator should worry

>about not about making sure that the working set fits in main memory
>but in the caches. Related things should get allocated on the same 4Kb page.

This is tricky. Block sizes (and prefetch ranges) are usually much smaller
than 4KB---e.g., 32 bytes or so. This will have to go up through increasing
block sizes or aggressive prefetching, to mask increasing latencies.

>I think you might need more info that you can currently get for the malloc
>interface for this.

I think this may be true in the long run, but in the short-to-medium run
I think we can get big wins by exploiting short object lifetimes and
phase structure. We at least need to sort those issues out before we
know whether there's a problem, or what kind of problem.

>|> On the other hand, the gap between CPU and disk speeds is widening;
>
> The gap between the CPU and memory (whether cache or main memory) is also
>widening at the same rate. So an on-chip cache miss is bad. Missing the secondary
>cache is a disaster.

It's bad, but it's not as big a disaster as paging. Poor cache locality
may cost you a factor of two or ten or more in extreme cases, but paging
is more likely to cost you something orders of magnitude worse.

>|> This implies that clustering is more attractive than true prefetching,
>|> because if I can group together things that are touched at very roughly
>|> the same time (e.g., within a few seconds of each other), I win. I
>|> don't have to make very short-term predictions, and I don't have to
>|> predict very precisely.
>
> But not just clustering into main memory. You have to get the items to
>cluster properly in the caches too. (I take it from your current interest
>in locality that you're aware of this problem.)

Yes. I just think we'll be able to do okay with our current approach.
For example, avoiding excessive header/footer overhead and internal
fragmentation avoids polluting the cache, and clustering related objects
gives spatial locality at all scales, all other things being equal.

Henry Baker

unread,
Aug 1, 1995, 3:00:00 AM8/1/95
to
In article <justin-0108...@158.234.26.212>, jus...@logcam.co.uk
(Justin Forder) wrote:

> In article <hbaker-3107...@192.0.2.1>, hba...@netcom.com (Henry
> Baker) wrote:
>
> [snip]


>
> > My own pet peeve is _weak_ language type systems like Ada and C++ that
> > _force_ you to break apart a single object into a number of smaller
subobjects
> > _that are separately mallocated_ even when you know how to manage the
> > object as one object in assembly language.

> I'm puzzled - in C++, using a pointer is the price you pay for polymorphism.
>
> The design space for C++ classes has two "safe corners":
>
> 1) for user-defined data types, with value semantics and no inheritance
> or virtual member functions, and with the relevant operators defined
> to make them well-behaved C++ data types,
>
> 2) the other for object-oriented programming, with inheritance, virtual
> member functions and destructors (and possibly protection
> against copying or assignment).
>
> These are Coplien's two "canonical class forms". Objects of the first kind
> (values) normally live on the stack or are directly embedded in larger
objects.
> Objects of the second kind (instances) normally live on the heap, and give
> rise to all the memory management problems.
>
> Things get more interesting when you move into the middle of the design space,
> or produce hybrid approaches (e.g. for large values with copy-on-write, or
> as with the many kinds of smart pointers).
>
> You are right to say that C++ throws a large problem over to application
> writers, but by doing this it does allow a great variety of solutions, and
> once you have chosen a solution it is often possible to package it in a way
> that can be used easily and concisely.

All you've said, is that if you go along with Stroustrup/Coplien's
programming styles, you can get the job done, sort of. However, I would
argue that following their advice can cost you 3X - 100X in performance,
because you drive the poor allocator/deallocator crazy.

Forcing a program to go 2 indirections to access array/string elements because
C++ has no way to store the length of the array/string along with the
array/string is probably the most irritatingly obvious indication of the
problem, but the problem is a good deal more widespread than this.

Stroustrup/Coplien argue that you now need to 'solve' this problem by using
type-specific allocators. Well, if C++ allowed you to do it correctly in
the first place, you would have no (or certainly much less) need for this
sort of hacking.

Justin Forder

unread,
Aug 1, 1995, 3:00:00 AM8/1/95
to
In article <hbaker-3107...@192.0.2.1>, hba...@netcom.com (Henry
Baker) wrote:

[snip]

> My own pet peeve is _weak_ language type systems like Ada and C++ that
> _force_ you to break apart a single object into a number of smaller subobjects
> _that are separately mallocated_ even when you know how to manage the

> object as one object in assembly language. Stroustrup's C++ is especially
> 'pointer-happy', as a perusal of the following two books will show:
>
> Stroustrup, Bjarne. "The C++ Programming Language, Second Edition". Addison
> Wesley, 1991.
>
> Ellis, M.A. and Stroustrup, B. "The Annotated C++ Reference Manual".
> Addison Wesley, 1990.
>
> Whenever things get tough, C++ wimps out and forces another pointer
indirection.
>
> Of course, this 'solves' the problem for the C++ implementor, but throws
> a much larger problem over the fence onto the poor application writer
> and/or memory manager, who is now forced to deal with a much higher load
> of smaller objects.
>
> Whoever said that C++ doesn't force you to pay for features you don't use
> hasn't ever programmed in C++ to any extent.

I'm puzzled - in C++, using a pointer is the price you pay for polymorphism.

The design space for C++ classes has two "safe corners":

1) for user-defined data types, with value semantics and no inheritance
or virtual member functions, and with the relevant operators defined
to make them well-behaved C++ data types,

2) the other for object-oriented programming, with inheritance, virtual
member functions and destructors (and possibly protection
against copying or assignment).

These are Coplien's two "canonical class forms". Objects of the first kind
(values) normally live on the stack or are directly embedded in larger objects.
Objects of the second kind (instances) normally live on the heap, and give
rise to all the memory management problems.

Things get more interesting when you move into the middle of the design space,
or produce hybrid approaches (e.g. for large values with copy-on-write, or
as with the many kinds of smart pointers).

You are right to say that C++ throws a large problem over to application
writers, but by doing this it does allow a great variety of solutions, and
once you have chosen a solution it is often possible to package it in a way
that can be used easily and concisely.

Justin

--
Justin Forder jus...@logcam.co.uk
Logica, Stephenson House, 75 Hampstead Road, London NW1 2PL, UK

Scott Wheeler

unread,
Aug 1, 1995, 3:00:00 AM8/1/95
to
-01089509...@192.0.2.1>
X-Newsreader: NewsBase v1.36 (Beta)
Lines: 16

In Article <hbaker-0108...@192.0.2.1> Henry Baker writes:
>All you've said, is that if you go along with Stroustrup/Coplien's
>programming styles, you can get the job done, sort of. However, I
>would argue that following their advice can cost you 3X - 100X in
>performance, because you drive the poor allocator/deallocator crazy.
>Forcing a program to go 2 indirections to access array/string elements
>because C++ has no way to store the length of the array/string along
>with the array/string is probably the most irritatingly obvious
indication of the problem, but the problem is a good deal more
widespread than this.

Surely you are not claiming that a Pascal/BASIC-type string arrangement
in memory is going to run at least 3x faster than a typical C++ string?
This strains credibility.

Scott

Stefan Monnier

unread,
Aug 2, 1995, 3:00:00 AM8/2/95
to
In article <hbaker-3107...@192.0.2.1>,
Henry Baker <hba...@netcom.com> wrote:
] IMHO the major gain from compaction is to defragment the address space,

] but as Paul's recent work has shown, one may not have to do this very
] often.

I thought the main advantage of compaction was that allocation could then be a
simple stack push. The rest is just nice side-effects !

] shows how a gc makes _excellent_ use out of a write-allocate cache, and


] that short-lived objects cause no problems at all, because _they live
] and die in the cache and need never cause any memory traffic at all_.

*BUT* in cases like the alpha processor where the 1st level cache is 8Kb in size
it means a tiny GC has to be done very often, unless you don't even try to deal
with 1st-level cache locality. Having the first generation fit in 128K (or maybe
even 32K) might be OK, but I'm not sure 8K is reasonably doable (this is just a
wild guess, tho, and it probably depends a lot on your allocation speed (why do
thos five letters "SMLNJ" spring to my mind ? beats me!)).

] Manual allocators have exactly the same trouble that GC's do if they


] have no way to tell the HW that a particular cache line is now dead.

Except that manual allocators can more easily reuse that dead space, so that
they don't need to tell the HW that it's dead.

] > Stefan Monnier (hey, that's me !) felt like saying:
] > >destructive read (can be used for linear-programming, or for stack-pops)).


] ^^^^^^^^^^^^^^^^^^
] I think you mean 'linear logic style of programming' here!

Right, I definitely meant 'linear logic style of programming'.


Stefan

Paul Wilson

unread,
Aug 2, 1995, 3:00:00 AM8/2/95
to
In article <3vnghm$s...@info.epfl.ch>,

Stefan Monnier <stefan....@epfl.ch> wrote:
>In article <hbaker-3107...@192.0.2.1>,
>Henry Baker <hba...@netcom.com> wrote:
>] IMHO the major gain from compaction is to defragment the address space,

>] but as Paul's recent work has shown, one may not have to do this very
>] often.

Right. For "most" programs (that have been studied in enough detail, that
is), it looks like you never have to do it at all. Nobody's studied
really long-running programs, though, and it's not clear whether
fragmentation will accumulate over time, or stabilize at an acceptable
level.

>I thought the main advantage of compaction was that allocation could then be a
>simple stack push. The rest is just nice side-effects !

This advantage is wildly overrated, for two reasons. First, if
you're not allocating tons of stuff on the heap when you don't need to,
shaving a very few instructions off of the allocation or deallocation
time isn't very important. GC usually loses in deallocation costs
and/or locality more than it gains in screamingly fast allocation.

The second is that a conventional allocator can support very fast
allocation. It's only very slightly more expensive to allocate from
a free list than to bump a pointer. It *is* just bumping a pointer,
to a first approximation. Most of the "costs" of allocation in the
usual case come from how conventional allocators are linked, not from
the allocator itself. If you have the size info at compile time, and
you inline the usual-case code for the allocator, most of the difference
goes away. See our allocator survey (available from the ftp site
in my .sig, below) for a more detailed explanation.

(Again, a caveat: I'm a huge fan of GC, and this should not be taken
too strongly. A well-implemented GC is a great thing. I'm just countering
some common hyperbole. Also, the indirect effects of having to deallocate
things explicitly probably often outweigh the direct performance
benefits---more on that later.)

>] shows how a gc makes _excellent_ use out of a write-allocate cache, and


>] that short-lived objects cause no problems at all, because _they live
>] and die in the cache and need never cause any memory traffic at all_.

>*BUT* in cases like the alpha processor where the 1st level cache is 8Kb in

>size it means a tiny GC has to be done very often, unless you don't even
>try to deal with 1st-level cache locality. Having the first generation fit
>in 128K (or maybe even 32K) might be OK, but I'm not sure 8K is reasonably
>doable (this is just a wild guess, tho, and it probably depends a lot on

>your allocation speed (why do those five letters "SMLNJ" spring to my
>mind ? beats me!)).

This is quite right. If you make the youngest generation small enough to
fit in a very small, fast cache, you lose on GC efficiency---you end
up tracing (marking or copying) more data, because you don't give things
as much time to die.

If you *dont'* do this, you can't reuse space as quickly, so you miss
your cache allocating things in memory that isn't still in the cache.

In the general case, this isn't *just* a bandwidth cost, because you're
*polluting* the cache with garbage. The fact that you're keeping dead
objects around untouched generally means that you're wasting cache
space. If you can repeatedly put several short-lived things in the
same chunk of memory (by reusing space immediately), you *should* win
relative to allocating them in different spaces, because you'll force
less *other* stuff out of the cache.

Surprisingly, the data for nearly-functional languages don't seem to
bear this out. I suspect that there are just too many other variables
involved in the way the tested systems are built, because this *has*
to be true in general. And in principle, whatever else those systems
are doing right could probably be done in an explicitly managed heap
as well. (But not necessarily in practice.)

>] Manual allocators have exactly the same trouble that GC's do if they


>] have no way to tell the HW that a particular cache line is now dead.
>

>Except that manual allocators can more easily reuse that dead space, so that

>they don't need to tell the HW that it's dead.

Exactly. In fact, a good normal allocator and a normal cache do exactly
this already---the allocator tells the hardware which bits no longer have to
be stored, by clobbering those bits with new bits---i.e., overwriting them
with new data when a new object is allocated there.

An explicit deallocator can do a lot of the same tricks as a GC's allocator.
For example, if it turned out to be an advantage to wander through lots
of memory without reusing it soon (which I'll bet it's not), and rely on
a write-allocate cache to avoid fetching data at each miss, an explicit
allocator can easily do that, too. It doesn't *have* to reuse memory
immediately. An explicit allocator always has the edge, in a sense,
because it simply has more information---it knows sooner when objects
die, because you tell it. This assumes that you tell it right, though.

Which brings us to the really sticky stuff. There are things no allocator
can control, GC'd or not, namely how programs are written. Allocators and
GC's affect programming style, however.

If you have a GC, you're likely to use a lot of shared data structures and
let the GC sort out when to reclaim things. If you don't have a GC, you're
likely to structure your program around that fact (even if you don't know
that's what you're doing, and don't know what you're missing by not having
a GC).

In non-GC'd languages, it's common to do tacky and inefficient things to
make it easy (or possible, in practical terms) to reclaim space without
long-term leaks. One very common strategy is to copy data structures
so that several modules can have their own copies, and each can free its
copy when it's through with it. This is often much easier to do than
to figure out when objects are no longer interesting to *any* module.

This is a functional programming trick, really---you lose object identity
because things that are (conceptually) individual objects are passed around
as values rather than referred to via pointers. The big problem is that
it's a functional programming style implemented very BADLY. In a real
functional programming language (or nearly functional programming language),
wth GC, the language implementation can do sharing automatically, and avoid
most of the copying. (In a nearly functional language or an OO language with
GC, the implementation can support object identity cleanly as well.)

The end result is that hands-on memory allocation may cost you more
performance than it gets you, in practice. It leads to an inefficient
programming style, which is less efficient than biting the GC cost bullet
and programming the right way.

The software engineering costs are more obvious---this kind of distortion
of programming styles is one of the big reasons there's not a lot of clean,
maintainable, extensible software. My belief is that in the long run,
GC'd programs are usually faster as well as cleaner, because given finite
amounts of programmer effort, GC allows you to focus on more important
performance issues, like plugging in different algorithms (for real problem
solving, not storage management) and getting direct performance wins.

This is, unfortunately, very difficult to prove, because it requires
controlling lots of variables and developing several versions of some
large software.

Dennis O'Connor -FT-~

unread,
Aug 2, 1995, 3:00:00 AM8/2/95
to

Scott Wheeler <sco...@bmtech.demon.co.uk> writes:
] Surely you are not claiming that a Pascal/BASIC-type string arrangement
] in memory is going to run at least 3x faster than a typical C++ string?
] This strains credibility.

"strlen" sure would. :-)
--
Dennis O'Connor doco...@sedona.intel.com
i960(R) Architecture and Core Design Not an Intel spokesman.
TIP#518 Fear is the enemy.

Henry Baker

unread,
Aug 2, 1995, 3:00:00 AM8/2/95
to
In article <jyu...@bmtech.demon.co.uk>, Scott Wheeler
<sco...@bmtech.demon.co.uk> wrote:

> -01089509...@192.0.2.1>
> X-Newsreader: NewsBase v1.36 (Beta)
> Lines: 16
>
> In Article <hbaker-0108...@192.0.2.1> Henry Baker writes:
> >All you've said, is that if you go along with Stroustrup/Coplien's
> >programming styles, you can get the job done, sort of. However, I
> >would argue that following their advice can cost you 3X - 100X in
> >performance, because you drive the poor allocator/deallocator crazy.
> >Forcing a program to go 2 indirections to access array/string elements
> >because C++ has no way to store the length of the array/string along
> >with the array/string is probably the most irritatingly obvious
> indication of the problem, but the problem is a good deal more
> widespread than this.
>

> Surely you are not claiming that a Pascal/BASIC-type string arrangement
> in memory is going to run at least 3x faster than a typical C++ string?
> This strains credibility.

If you allocate/deallocate these things very often, 3x may be conservative.
You pay some overhead simply accessing the string on a machine with a
longer cache line, because the string header will cause a cache fault,
and proceed to bring in a lot of stuff that won't be used, followed by
an indirection which causes another cache fault when you start gaining
access to the characters themselves.

The C++ beginner is sucked in because 1) strings are 0x00 terminated,
and therefore don't usually require that the programmer keep separate
track of a length, and 2) the actual length of the storage is kept in
a 'hidden' location managed by the allocator itself. Thus, you get
another example of hidden overhead that you can't even access yourself
-- e.g., to get a rough idea of how big the string is before searching
for the 0x00.

If you have to dynamically allocate arrays of kinds other than C-style
strings, you now have to explicitly deal with a length, and you probably
want to check indices for legality, so an explicit length field becomes
a necessity.

Basic is slow because of interpreter overhead, not because of string
allocation overhead.

Just read the various C++ forums (fora?) for a while, and you keep getting
this feeling of deja vu all over again, as everyone describes their favorite
'specialized allocator' hack of the week for how to deal with allocator
overhead.

Paul Flinders

unread,
Aug 3, 1995, 3:00:00 AM8/3/95
to

In article <3vlgie$h...@jive.cs.utexas.edu> wil...@cs.utexas.edu (Paul Wilson) writes:


Consider the case where the large objects are 1MB pixelmaps, used briefly
to store a frame of graphics as it's generated, just long enough to generate
and display an image, and the small ones are something else that you save
for a long time, say, 16 bytes each. In this case, each of the little
objects end up costing you about 1 MB each, because they wreck a 1 MB
free space even though you're only using 16 bytes.

You're very likely to simply use so much memory that the program fails
from lack of swap space. It's also likely to run fairly slowly, on
a fast machine, because you have to page out all of the wasted space---it's
dirty because a short-lived graphic image was briefly stored there.
So in this extreme case, you lose in both overall memory usage and in
locality.


I've experienced this - some time ago I was investigating a
triangulation routine which ran substantially faster the first time
through (5-10x). It was allocating large numbers (> 1million) of small
structures for the triangle mesh - at first we thought that malloc was
allocating objects scattered through memory once the program had run
for a while but investigation revealed that was not the case. Rather
the first time through the OS gave us virgin zero-filled pages, the
second time through, as we hit the pages (which had been paged out in
the meantime) they had to be paged back in.

To my mind the two examples that Paul gave and the one above
illustrate that we should not place too much reliance on a single
general purpose allocator - especially if we can use information about
the likely behaviour an the objects in it. In the first case separate
allocators for large and small objects would reduce or elliminate the
fragmentation. In my example (allocation of a collection of large
numbers of small objects which have a relatively short life)
performance was vastly improved by writing a custom allocator (Ok, Ok,
yes I *know* ...) which managed a heap in a "mmaped" region - when the
mesh was finished with it was returned to the OS (thus eliminating
both the redundant page out and page in).

With current OSes (ok I'm thinking mostly of Unix) we probably need
three allocators

1) General purpose for small-medium size short-long lived objects
2) A large object allocator
3) An allocatpor for large collections of short lived objects.

Separating large and small objects should reduce fragmentation, being
able to signal to the memory management system that we're no longer
interested in the data a section of memory contains helps reduce
unwanted write-backs and re-loads (both for paged systems and at a
finer level for caches).

TTFN

Paul Flinders
(P.T.Fl...@cs.bham.ac.uk or pfli...@farnboro02.datasci.co.uk)

Scott Wheeler

unread,
Aug 3, 1995, 3:00:00 AM8/3/95
to

I had assumed from your "2 indirections" that you were talking about
C++ strings, not C strings - the former are usually implemented with
length stored explicitly, not relying on null termination. Data
representation is something like

class str {
int iCount;
char *pachData;
};

Of course this can mean that the length count and character array
pointer are stored separately from the array itself, which was what I
thought your complaint was. My reference to Pascal and BASIC was
because they are usually implemented similarly but with the count and
the character array stored contiguously, avoiding one level of
indirection and only allocating one block of memory. Data
representation is often something like

class str {
int iCount;
char achData[256];
};

with the obvious problems in terms of fixed maximum length of the
string.

I am afraid I am now unclear as to what you are saying can be more
than 3x faster than the C++ means of handling strings in respect of
allocate/deallocate. Can you sketch out your preferred solution, and
perhaps we can work out some way of benchmarking it?

Scott

dsie...@icaen.uiowa.edu

unread,
Aug 3, 1995, 3:00:00 AM8/3/95
to
p...@fat-controller.cs.bham.ac.uk (Paul Flinders) writes:


>I've experienced this - some time ago I was investigating a
>triangulation routine which ran substantially faster the first time
>through (5-10x). It was allocating large numbers (> 1million) of small
>structures for the triangle mesh - at first we thought that malloc was
>allocating objects scattered through memory once the program had run
>for a while but investigation revealed that was not the case. Rather
>the first time through the OS gave us virgin zero-filled pages, the
>second time through, as we hit the pages (which had been paged out in
>the meantime) they had to be paged back in.


GNU's mmalloc package uses mmap rather than sbrk to do its allocation. So
free truly frees the pages, returning them to the OS. Having that sort of
behavior would be good for a lot of cases (including yours) where the generic
sbrk-based malloc doesn't cut it. Some sbrk-based mallocs will return the
memory to the OS, but obviously only if you free memory on the tail end of
the DATA segment, if you malloc 100 megs of memory you want to free and then
malloc 1 byte freeing that 100 megs won't return the memory to the OS with
malloc, but will with mmalloc.

--
Doug Siebert || "Usenet is essentially Letters to the Editor
University of Iowa || without the editor. Editors don't appreciate
dsie...@icaen.uiowa.edu || this, for some reason." -- Larry Wall

Mike McDonald

unread,
Aug 3, 1995, 3:00:00 AM8/3/95
to
In article <jyv...@bmtech.demon.co.uk>, Scott Wheeler <sco...@bmtech.demon.co.uk> writes:


|> class str {
|> int iCount;
|> char achData[256];
|> };
|>
|> with the obvious problems in terms of fixed maximum length of the
|> string.

Yuch! In C, (I don't know how to do it in C++) you'd do something like:

typedef struct string {
int length;
char contents[0];
} *String;

String make_string(int n)
{
String s;

s = (String) calloc(1, sizeof(struct string) + n);
if (s == NULL)
you_lose();
s->length = n;
return (s);
}


(If your C compiler doesn't like arrays of length zero, declare contents as
length 1 and subtract 1 from the sizeof(struct string).)

This way, both the length and then data are allocated together in the heap. And
you haven't limited the max length artifically. (I suspect most BASIC systems do
this.)

|> I am afraid I am now unclear as to what you are saying can be more
|> than 3x faster than the C++ means of handling strings in respect of
|> allocate/deallocate. Can you sketch out your preferred solution, and
|> perhaps we can work out some way of benchmarking it?
|>
|> Scott

A page fault or cache miss between accessing the length and the contents cost
one heck of a lot! I wouldn't be surprised that with a lot of allocators, the
miss happens often enough to cause the 3x difference in speed overall.

Mike McDonald
mik...@engr.sgi.com

Cyber Surfer

unread,
Aug 3, 1995, 3:00:00 AM8/3/95
to
In article <jyv...@bmtech.demon.co.uk>
sco...@bmtech.demon.co.uk "Scott Wheeler" writes:

> class str {
> int iCount;
> char achData[256];
> };
>
> with the obvious problems in terms of fixed maximum length of the
> string.

It could be written as:

class str {
int iCount;
char achData[1];
};

Then you can allocate the extra memory. C++, like C, doesn't stop
you from doing this.
--
<URL:http://www.demon.co.uk/community/index.html>
<URL:http://www.enrapture.com/cybes/namaste.html>
"You can never browse enough."

Paul Flinders

unread,
Aug 4, 1995, 3:00:00 AM8/4/95
to

It could be written as:

class str {
int iCount;
char achData[1];
};

Then you can allocate the extra memory. C++, like C, doesn't stop
you from doing this.

Ok, you *could* do this by overloading operator new - but God help you
if str has virtual functions as the virtual function table pointer
will be in the middle of your achData array.

Markus Freericks

unread,
Aug 4, 1995, 3:00:00 AM8/4/95
to
In article <807474...@wildcard.demon.co.uk> Cyber Surfer <cyber_...@wildcard.demon.co.uk> writes:
> It could be written as:
>
> class str {
> int iCount;
> char achData[1];
> };
>
> Then you can allocate the extra memory. C++, like C, doesn't stop
> you from doing this.

No, but its wrong, wrong, wrong. Imagine a compiler that uses different
pointer types for differently-sized objects, e.g. a PC compiler.

str* foo=malloc(sizeof(str)+7000)

will give you a near pointer, while

str* foo=malloc(sizeof(str)+70000)

will give you some kind of far pointer, and likely break your program.

The Correct way is

class str {
int iCount;
char achData[MAXDATA]; /* whatever maximum size you want */
};

#define alloc_str_of_size(N) (str*)malloc(sizeof(str)-MAXDATA+N)

-- Markus

Cyber Surfer

unread,
Aug 4, 1995, 3:00:00 AM8/4/95
to
In article <PTF.95Au...@fat-controller.cs.bham.ac.uk>
p...@fat-controller.cs.bham.ac.uk "Paul Flinders" writes:

> Ok, you *could* do this by overloading operator new - but God help you
> if str has virtual functions as the virtual function table pointer
> will be in the middle of your achData array.

That's why I only use that technique in C, or a subset of C++. As far
as I'm aware, you can still use malloc/free to allocate C++ objects,
and you don't have to use virtual destructors. You're choices may be
restricted, compared with some other languages, but it's still possible
to do it. I was simply pointing out another way that it might be done.

<URL:ftp://ftp.demon.co.uk/pub/rfc/rfc1738.txt>

Cyber Surfer

unread,
Aug 4, 1995, 3:00:00 AM8/4/95
to
In article <3vssnd$h...@news.cs.tu-berlin.de>
m...@cs.tu-berlin.de "Markus Freericks" writes:

> No, but its wrong, wrong, wrong. Imagine a compiler that uses different
> pointer types for differently-sized objects, e.g. a PC compiler.
>
> str* foo=malloc(sizeof(str)+7000)
>
> will give you a near pointer, while
>
> str* foo=malloc(sizeof(str)+70000)
>
> will give you some kind of far pointer, and likely break your program.

That's why I never use sizeof to find the length of a string. I _know_
when a C/C++ string object is the size the compiler thinks it is, and
when it is not.

It may not be clean, but it works.



> The Correct way is
>
> class str {
> int iCount;
> char achData[MAXDATA]; /* whatever maximum size you want */
> };
>
> #define alloc_str_of_size(N) (str*)malloc(sizeof(str)-MAXDATA+N)

So you still have to know when the true size of the string will be
different to what the compiler thinks it is. I don't this is any more
"correct" than any other version we've seen here. The differences
are of style, and a matter of how much work a programmer is prepared
to do for the compiler.

I prefer to just let a compiler handle all this, but as far as I'm
aware, there's still no "standard" C++ string class. Please correct
me if I'm wrong, as it would be news to me.

Perhaps this is just as well, as we might end up with one that we're
not happy with. At present, if we're lucky enough, we can choose how
to design a string class. It's generally a choice that has been made
for me, if I'm using a C++ framework.

Scott Wheeler

unread,
Aug 4, 1995, 3:00:00 AM8/4/95
to
In Article <3vr85r$7...@fido.asd.sgi.com> Mike McDonald writes:
>In article <jyv...@bmtech.demon.co.uk>, Scott Wheeler
<sco...@bmtech.demon.co.uk> writes:
>
>
>|> class str {
>|> int iCount;

>|> char achData[256];
>|> };
>|>
>|> with the obvious problems in terms of fixed maximum length of the
>|> string.

BTW, note that I wasn't recommending this design, just providing it for
illustration.

>
> Yuch! In C, (I don't know how to do it in C++) you'd do something
like:
>
>typedef struct string {
> int length;
> char contents[0];
> } *String;
>
>String make_string(int n)
>{
> String s;
>
> s = (String) calloc(1, sizeof(struct string) + n);
> if (s == NULL)
> you_lose();
> s->length = n;
> return (s);
>}
>
>
>(If your C compiler doesn't like arrays of length zero, declare
>contents as length 1 and subtract 1 from the sizeof(struct string).)
>

Of course I've come across this sort of kluge before in maintaining old
Unix code, but to me it epitomises all the vices of which non-C
programmers accuse us. Thoroughly 'orrible. Incidentally, it's not
legal ANSI C or C++, because you are only allowed to index one position
after the end of the array. Granted, it will usually work, apart from
the pointer size problem someone else pointed out. Oh, and you may get
alignment problems. In an equivalent C++ class, you would *probably*
avoid clobbering the pointer to the vtable, because it's usually in
front of the first data member, but no guarantees; especially if you do
multiple inheritance from this class.

> This way, both the length and then data are allocated together in
the heap. And you haven't limited the max length artifically. (I
>suspect most BASIC systems do this.)

Er, you *have* limited the length, but have deferred the decision to
run-time. Not a big win. With the normal technique (pointer to separate
character array), you can add to the string as much as you want, and
rely on the class replacing the character array with a bigger one when
necessary.


>
>|> I am afraid I am now unclear as to what you are saying can be more
>|> than 3x faster than the C++ means of handling strings in respect of
>|> allocate/deallocate. Can you sketch out your preferred solution,
and
>|> perhaps we can work out some way of benchmarking it?
>|>
>|> Scott
>
> A page fault or cache miss between accessing the length and the
>contents cost one heck of a lot! I wouldn't be surprised that with a
>lot of allocators, the miss happens often enough to cause the 3x
>difference in speed overall.

Yes, but *is* there a 3x difference in practice? I really doubt it,
because most C++ programs are reasonably memory efficient (compared to
something like Lisp, which needs to carry a lot more run-time support)
and are quite likely to sit mainly or entirely in core. Specify some
realistic operating conditions and we can try it out.

Scott

David Chase

unread,
Aug 4, 1995, 3:00:00 AM8/4/95
to
Cyber Surfer <cyber_...@wildcard.demon.co.uk> writes:
> > It could be written as:
> >
> > class str {
> > int iCount;

> > char achData[1];
> > };
> >
> > Then you can allocate the extra memory. C++, like C, doesn't stop
> > you from doing this.

But (as the next poster pointed out) this is wrong.

m...@cs.tu-berlin.de (Markus Freericks) writes:
> The Correct way is
>
> class str {
> int iCount;


> char achData[MAXDATA]; /* whatever maximum size you want */
> };
>
> #define alloc_str_of_size(N) (str*)malloc(sizeof(str)-MAXDATA+N)

I think this is also wrong, though it is difficult to find language in the
C++ working paper that says one way or another. You'll certainly cause
pain for the tools (debuggers, interpreters) that you use. The "correct"
thing to do is to quit bending the rules and do the second indirection, as
in:

class str {
int iCount;
char * achData;
};

If you're going to be accessing the string in any significant way, it's
difficult to imagine that the single second indirection will cause you
any significant loss of efficiency. Besides which, if you plan to use
C++, you'd be much better off with something like (and this is still
pretty low-level, as interfaces go):

class str {
int iCount;
char * achData;
public:
inline int length() {return iCount;}
inline char * string() {return achData;}
str(int n) iCount(n), achData(new char[n]) : { memset(achData, 0, n); }
~str() { delete[] achData; }
};

You should really resist the urge to write your own memory allocator,
and instead agitate for better ones from vendors. Or, find another
vendor. Or, find one in the public domain written by someone else.

speaking for myself,

David Chase

Henry Baker

unread,
Aug 4, 1995, 3:00:00 AM8/4/95
to
In article <jyv...@bmtech.demon.co.uk>, Scott Wheeler
<sco...@bmtech.demon.co.uk> wrote:

> class str {
> int iCount;


> char *pachData;
> };
>
> Of course this can mean that the length count and character array
> pointer are stored separately from the array itself, which was what I
> thought your complaint was. My reference to Pascal and BASIC was
> because they are usually implemented similarly but with the count and
> the character array stored contiguously, avoiding one level of
> indirection and only allocating one block of memory. Data
> representation is often something like
>

> class str {
> int iCount;
> char achData[256];
^^^ This is silly. The usual C solution is:
char achData[]; /* Whatever length you need */


> };
>
> with the obvious problems in terms of fixed maximum length of the
> string.

If you want to be able to increase the size of the string, then you will have
to reallocate the string and store a new pointer into the header. But
anyone using such 'growable' strings would presumably be ready to pay
this penalty.

I would contend that the vast majority of strings are not only fixed
length, but _constant_, and since there is no need to grow the string,
there is also no need to force a second indirection. The problem with
C++ is that there is no sanctioned way to store the length field contiguously
with the characters themselves without a lot of casting to get around
the type system. You basically have to go back to C mode and perhaps
even call malloc directly. You lose all of the 'advantages' of the C++
type system and ctors/dtors.

> I am afraid I am now unclear as to what you are saying can be more
> than 3x faster than the C++ means of handling strings in respect of
> allocate/deallocate. Can you sketch out your preferred solution, and
> perhaps we can work out some way of benchmarking it?

--
www/ftp directory:
ftp://ftp.netcom.com/pub/hb/hbaker/home.html

Hans Boehm

unread,
Aug 4, 1995, 3:00:00 AM8/4/95
to
hba...@netcom.com (Henry Baker) writes:

[previous discussion about C++ requiring string headers and indirection elided]

>All you've said, is that if you go along with Stroustrup/Coplien's
>programming styles, you can get the job done, sort of. However, I would
>argue that following their advice can cost you 3X - 100X in performance,
>because you drive the poor allocator/deallocator crazy.

>Forcing a program to go 2 indirections to access array/string elements because
>C++ has no way to store the length of the array/string along with the
>array/string is probably the most irritatingly obvious indication of the
>problem, but the problem is a good deal more widespread than this.

This discussion isn't very clear to me. I doubt that the extra level of
indirection for strings normally costs you much on accesses. Hopefully
most of the character accesses occur within library functions which
perform the indirection once and then look at many characters. Even
if this is not the case, a clever compiler could usually amortize the
dereference over many character accesses.

Granted, there is some additional allocation cost involved. But this at
most doubles the cost involved relative to keeping a length and the characters
in a single memory object. Aside from cases in which this just pushes you
over the paging threshold, I don't see where the 3X to 100X slowdown
should come from.

There are other places in which commonly accepted C++ methodology can easily
result in such slowdowns. Performing deep copies in order to maintain
unique references to heap objects can have horrible consequences. You can
easily change a program with linear running time into one with exponential
time and space requirements. The use of flat string representations,
as opposed to tree-based representations, can turn a program with linear
running time into one with quadratic running time. But all of the
string representations discussed here are flat (and in my opinion poor
choices for a general-purpose string package).

Are we really talking about these other costs? They are much more
related to the absence of garbage collection than they are to the type
system.

Note that there are other cases in which introducing additional objects
and indirection levels can save space, especially with garbage collection.
See ftp://parcftp.xerox.com/pub/gc/papers/pldi93.ps.Z for details.

Should this discussion be moved to comp.lang.c++?

Hans-J. Boehm
(bo...@parc.xerox.com)
Standard disclaimer ...

Marco Antoniotti

unread,
Aug 4, 1995, 3:00:00 AM8/4/95
to
In article <3vssnd$h...@news.cs.tu-berlin.de> m...@cs.tu-berlin.de (Markus Freericks) writes:

From: m...@cs.tu-berlin.de (Markus Freericks)
Date: 04 Aug 1995 11:27:53 GMT
Organization: TU Berlin Fachbereich Informatik
Lines: 32
References: <jyv...@bmtech.demon.co.uk> <807474...@wildcard.demon.co.uk>
NNTP-Posting-Host: lenin.cs.tu-berlin.de
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit

In article <807474...@wildcard.demon.co.uk> Cyber Surfer <cyber_...@wildcard.demon.co.uk> writes:
> It could be written as:
>

> class str {
> int iCount;


> char achData[1];
> };
>
> Then you can allocate the extra memory. C++, like C, doesn't stop
> you from doing this.

No, but its wrong, wrong, wrong. Imagine a compiler that uses different


pointer types for differently-sized objects, e.g. a PC compiler.

str* foo=malloc(sizeof(str)+7000)

will give you a near pointer, while

str* foo=malloc(sizeof(str)+70000)

will give you some kind of far pointer, and likely break your program.

The Correct way is

class str {
int iCount;


char achData[MAXDATA]; /* whatever maximum size you want */
};

#define alloc_str_of_size(N) (str*)malloc(sizeof(str)-MAXDATA+N)


I will try to be extremely annoying in this post. Please bear with me
and be prepared to grin....

Here comes the flame bait.

If I have to allocate a string of lenght N (where N is a variable),
the best way to do it is

(make-string N)

Cheers

--
Marco G. Antoniotti - Resistente Umano
-------------------------------------------------------------------------------
Robotics Lab | room: 1220 - tel. #: (212) 998 3370
Courant Institute NYU | e-mail: mar...@cs.nyu.edu
| WWW: http://found.cs.nyu.edu/marcoxa

...e` la semplicita` che e` difficile a farsi.
...it is simplicity that is difficult to make.
Bertholdt Brecht

Joe Buck

unread,
Aug 4, 1995, 3:00:00 AM8/4/95
to

Scott Wheeler <sco...@bmtech.demon.co.uk> writes:
>] Surely you are not claiming that a Pascal/BASIC-type string arrangement
>] in memory is going to run at least 3x faster than a typical C++ string?
>] This strains credibility.

doco...@sedona.intel.com (Dennis O'Connor -FT-~) writes:
>"strlen" sure would. :-)

Dennis, in the ANSI/ISO draft standard for C++ "string" refers to a
class object, not to the standard C null-terminated character array.
It's quite likely that vendors will implement this in much the same
way Pascal and Basic implement strings, in which case strlen is an
inline function that returns the count field.

--
-- Joe Buck <jb...@synopsys.com> (not speaking for Synopsys, Inc)
Anagrams for "information superhighway": Enormous hairy pig with fan
A rough whimper of insanity

Henry Baker

unread,
Aug 4, 1995, 3:00:00 AM8/4/95
to
In article <3vtcud$u...@wcap.centerline.com>, ch...@centerline.com (David
Chase) wrote:

> class str {
> int iCount;


> char * achData;
> };
>
> If you're going to be accessing the string in any significant way, it's
> difficult to imagine that the single second indirection will cause you
> any significant loss of efficiency. Besides which, if you plan to use

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ <---- We were talking about 3X & greater


> C++, you'd be much better off with something like (and this is still
> pretty low-level, as interfaces go):
>

> class str {
> int iCount;


> char * achData;
> public:
> inline int length() {return iCount;}
> inline char * string() {return achData;}
> str(int n) iCount(n), achData(new char[n]) : { memset(achData, 0, n); }
> ~str() { delete[] achData; }
> };

David:

You may have missed the start of this thread. The major complaint is
not in simply _accessing_ the string, although there is definitely a
slowdown associated with that, but with _allocating/deallocating_ the
string.

The Stroustrup/Coplien school advocates using 2 mallocated objects in C++
when one will do in C, and this thread already described many of the
horrors of malloc inefficiencies.

So the point of the posting was that the Stroustrup/Coplien style
_guarantees_ bad performance, unless you are using class-specific
allocation, in which you still get bad performance (although possibly
not quite as egregious), but now difficult-to-maintain code, as well.

> You should really resist the urge to write your own memory allocator,
> and instead agitate for better ones from vendors. Or, find another
> vendor. Or, find one in the public domain written by someone else.

This is the point of the thread, but I was simply pointing out that
Stroustrup/Coplien acknowledge that their school has bad allocator
performance, and recommend that users write their own mallocator.

So you have now gone on record as disagreeing with Stroustrup & Coplien.

Cyber Surfer

unread,
Aug 4, 1995, 3:00:00 AM8/4/95
to
In article <3vtcud$u...@wcap.centerline.com>
ch...@centerline.com "David Chase" writes:

> > > Then you can allocate the extra memory. C++, like C, doesn't stop
> > > you from doing this.
>

> But (as the next poster pointed out) this is wrong.

I've still found it useful. Curiously, one example was in an allocator
for a Lisp heap. I knew _exactly_ how it was going to be used, and if
it was safe enough.

> I think this is also wrong, though it is difficult to find language in the
> C++ working paper that says one way or another. You'll certainly cause
> pain for the tools (debuggers, interpreters) that you use. The "correct"

I've never used a C/C++ interpreter, so I'm not worried about that,
and I've not used a debgugger for a few years now. Perhaps when I
upgrade my current C/C++ system, I'll try the debugger and test this.

> You should really resist the urge to write your own memory allocator,
> and instead agitate for better ones from vendors. Or, find another
> vendor. Or, find one in the public domain written by someone else.

I needed one with a GC, and as I couldn't get a commercial one, I write
my own. If you know of a C/C++ vendor that includes an alternative to
the malloc/free, new/delete style of allocation, please let me know.
I'll certainly be curious, and if the system includes support for a
platform that I'm developing for, then I may be able to use it.

Thanks.

Cyber Surfer

unread,
Aug 4, 1995, 3:00:00 AM8/4/95
to
In article <hbaker-0408...@192.0.2.1>
hba...@netcom.com "Henry Baker" writes:

> I would contend that the vast majority of strings are not only fixed
> length, but _constant_, and since there is no need to grow the string,
> there is also no need to force a second indirection. The problem with

Agreed. That's certainly the case in my C/C++ code. That's the nature
of the apps I write, rather than the way I code.

> C++ is that there is no sanctioned way to store the length field contiguously
> with the characters themselves without a lot of casting to get around
> the type system. You basically have to go back to C mode and perhaps
> even call malloc directly. You lose all of the 'advantages' of the C++
> type system and ctors/dtors.

Yep. I hate that, but I can live with it. I have to live without other
things, things that I'd _much_ rather have. So it goes. I'm not alone.

Robin Popplestone

unread,
Aug 4, 1995, 3:00:00 AM8/4/95
to
> In article <3vnghm$s...@info.epfl.ch>,
> Stefan Monnier <stefan....@epfl.ch> wrote:
>In article <hbaker-3107...@192.0.2.1>,
>Henry Baker <hba...@netcom.com> wrote:
>] IMHO the major gain from compaction is to defragment the address space,
>] but as Paul's recent work has shown, one may not have to do this very
>] often.

And Paul Wilson writes in <3vnvr1$9...@jive.cs.utexas.edu>

> Right. For "most" programs (that have been studied in enough detail, that
> is), it looks like you never have to do it at all. Nobody's studied
> really long-running programs, though, and it's not clear whether
> fragmentation will accumulate over time, or stabilize at an acceptable
> level.

The longest running, biggest mix, garbage collected systems are probably
those in which the operating system itself is subject to the same garbage
collector as user-programs. Experience with the Multipop time-sharing
system over 2 decades ago showed that compaction was necessary in the
circumstances pertaining then, at least in a small machine (384KB). Both
data and code-blocks were garbage collected, and the job-mix included
robotic programs that had largish digitised images as well as classic
list processing such as the (original) Boyer-Moore theorem prover.
The garbage collector operated in-situ (no virtual memory!), but had a
compaction pass which was conditionally invoked. Without compaction, the
system would rapidly become unusable and have to be rebooted. With
compaction it would run for a (small) number of days.

How this experience would scale to a modern environment is not obvious.

Robin Popplestone.

Patrick Herring

unread,
Aug 4, 1995, 3:00:00 AM8/4/95
to
p...@fat-controller.cs.bham.ac.uk writes in article <PTF.95Au...@fat-controller.cs.bham.ac.uk>:

>
> To my mind the two examples that Paul gave and the one above
> illustrate that we should not place too much reliance on a single
> general purpose allocator - especially if we can use information about
> the likely behaviour an the objects in it. In the first case separate
> allocators for large and small objects would reduce or elliminate the
> fragmentation. In my example (allocation of a collection of large
> numbers of small objects which have a relatively short life)
> performance was vastly improved by writing a custom allocator (Ok, Ok,
> yes I *know* ...) which managed a heap in a "mmaped" region - when the
> mesh was finished with it was returned to the OS (thus eliminating
> both the redundant page out and page in).
>
> With current OSes (ok I'm thinking mostly of Unix) we probably need
> three allocators
>
> 1) General purpose for small-medium size short-long lived objects
> 2) A large object allocator
> 3) An allocatpor for large collections of short lived objects.
>
> Separating large and small objects should reduce fragmentation, being
> able to signal to the memory management system that we're no longer
> interested in the data a section of memory contains helps reduce
> unwanted write-backs and re-loads (both for paged systems and at a
> finer level for caches).

IDMS-DC (a TPM for mainframes) seems to get considerable mileage from
distinguishing short- and long-term memory. Short-term is allocated from the
start and long-term from the end of the storage pool (effectively the global
public heap). Short-term is for program data (ie a few micro-seconds - in
IDMS-DC programs do their stuff then tell the OS to wait for the user to do
something by setting up an event control block for the terminal, then exit) and
long-term for saving data for reuse when the user presses enter (eg db locks,
previous screen images). Of course this needs a preallocated fixed size storage
area - IDMS-DC has a screen of diagnostics eg high water marks so that the
DBAs can keep things happy.

Just another perspective.

Yours, Patrick
_____________________________________________________________________________

Patrick Herring, p...@anweald.exnet.co.uk
I tend to eat my UUCP feed once a day so replies can take two days

Paul Wilson

unread,
Aug 5, 1995, 3:00:00 AM8/5/95
to
In article <3vtvi5$g...@kernighan.cs.umass.edu>,

Robin Popplestone <p...@cs.umass.edu> wrote:
>Paul Wilson writes in <3vnvr1$9...@jive.cs.utexas.edu>
>
>> Right. For "most" programs (that have been studied in enough detail, that
>> is), it looks like you never have to do it at all. Nobody's studied
>> really long-running programs, though, and it's not clear whether
>> fragmentation will accumulate over time, or stabilize at an acceptable
>> level.
>
>The longest running, biggest mix, garbage collected systems are probably
>those in which the operating system itself is subject to the same garbage
>collector as user-programs. Experience with the Multipop time-sharing
>system over 2 decades ago showed that compaction was necessary in the
>circumstances pertaining then, at least in a small machine (384KB). Both
>data and code-blocks were garbage collected, and the job-mix included
>robotic programs that had largish digitised images as well as classic
>list processing such as the (original) Boyer-Moore theorem prover.
>The garbage collector operated in-situ (no virtual memory!), but had a
>compaction pass which was conditionally invoked. Without compaction, the
>system would rapidly become unusable and have to be rebooted. With
>compaction it would run for a (small) number of days.

I find this pretty interesting. I take it this was a unified heap,
like in a Lisp machine system, where you don't have the usual process
memory boundaries, but unlike a Lisp machine in that you seldom do
compaction very often.

Unified heaps for multiple processes are interesting, because nobody
knows whether the intermixing of objects created and used for different
concurrent processes tends to increase fragmentation, or reduce it,
or what. (And I think that's likely to be very allocator dependent.)

The fact that this system could run for a day or more without accumulating
pathological fragmentation is a good sign, up to a point.

Is there a more detailed description of the allocator strategy? (E.g.,
did the allocator attempt to separte objects created by different
processes, or just mix them together in allocation order, or what?)

Anecdotes about what was tried and *didn't* work well would be as
interesting as descriptions of what was eventually settled on.

>How this experience would scale to a modern environment is not obvious.

Too true, but it's food for thought.

>Robin Popplestone.

Kelvin Nilsen

unread,
Aug 5, 1995, 3:00:00 AM8/5/95
to
p...@cs.umass.edu ( Robin Popplestone ) writes:

>> In article <3vnghm$s...@info.epfl.ch>,
>> Stefan Monnier <stefan....@epfl.ch> wrote:
>>In article <hbaker-3107...@192.0.2.1>,
>>Henry Baker <hba...@netcom.com> wrote:
>>] IMHO the major gain from compaction is to defragment the address space,
>>] but as Paul's recent work has shown, one may not have to do this very
>>] often.

>And Paul Wilson writes in <3vnvr1$9...@jive.cs.utexas.edu>

>> Right. For "most" programs (that have been studied in enough detail, that
>> is), it looks like you never have to do it at all. Nobody's studied
>> really long-running programs, though, and it's not clear whether
>> fragmentation will accumulate over time, or stabilize at an acceptable
>> level.

>The longest running, biggest mix, garbage collected systems are probably


>those in which the operating system itself is subject to the same garbage
>collector as user-programs. Experience with the Multipop time-sharing
>system over 2 decades ago showed that compaction was necessary in the
>circumstances pertaining then, at least in a small machine (384KB).

While many individual programs tend to exhibit the behavior that Paul
describes, I would certainly not want to gamble the future of automatic
garbage collection on overgeneralization of this empirical observation.
There are a number of reasons why it would be wise to hedge your bets:

One reason that defragmentation is often not necessary is because
many of the programs that have been studied empirically allocate
a large proportion of their objects from a small number of
predetermined size classes. This is not necessarily representative
of the "typical" application of the future:

a) The tendency to allocate objects in predetermined sizes is
in part a consequence of certain programming language
implementation designs (e.g. Lisp uses dotted pairs to
represent "everything") and partly the influence of static
programming language tunnel vision. One of the indirect benefits
of garbage collection is that it frees programmers from these
restrictive mindsets. Cdr-coded Lisp lists are likely to
vary greatly in length. Does anyone have statistics? In
Icon, efficiency of string operations was a major design
objective. Since the implementation of strings is so efficient,
Icon programmers feel very comfortable creating strings ranging
in length from several bytes to tens of kilobytes. Some
real-world programs even create strings that exceed a hundred
kilobytes. Icon's implementations of lists and dynamic hash tables
also benefit greatly from the ability (made possible through
automatic compacting garbage collection) to size internal objects
in whatever sizes are most appropriate for run-time efficiency.

b) Consider the implementation of digital TV studio software. The
"programmer" is likely to need quick access to a large number of
digital audio and video clips. The lengths of these clips will
be highly unpredictable. I can envision the day when these
audio-visual clips would be as easily manipulated as Icon strings
are manipulated today.

c) Consider the implementation of a PDA system. Suppose the system
has no disk. The PDA is required to retain its user's appointment
calendar (with variable amounts of information representing each
day's scheduled and logged activities), contact lists (with variable
amounts of information associated with each contact), scratch pads
(including spreadsheets, ASCII notes, untranslated pen scribbles,
vector graphics, email memos sent and received, faxes sent
and received), voice notes (annotations to the scratch-pad
information, personal dictations, voice mail forwarded from
the central office), executable code segments (for each of the
supported applications), etc. Clearly the sizes of allocated
segments are highly unpredictable. Now, add another complicating
factor: compress all objects that are larger than a particular
threshold size and have not been accessed within the last N
minutes. The compression factor itself is unpredictable.

In summary, the notion that "defragmentation of dynamic memory is not
necessary" is a self-fulfilling prophecy. In the absence of compaction
and/or in environments that provide seemingly unbounded amounts of
virtual memory, programmers are likely to develop software in ways that
do not depend on defragmentation, but there are real, non-trivial costs
associated with this style of development.


--

Kelvin Nilsen/Dept. of Computer Science/Iowa State University/Ames, IA 50011
(515) 294-2259 kel...@cs.iastate.edu uunet!cs.iastate.edu!kelvin


Paul Wilson

unread,
Aug 5, 1995, 3:00:00 AM8/5/95
to
In article <807549...@anweald.exnet.co.uk>,

Patrick Herring <p...@anweald.exnet.co.uk> wrote:
>p...@fat-controller.cs.bham.ac.uk writes in article <PTF.95Au...@fat-controller.cs.bham.ac.uk>:
>>
>> To my mind the two examples that Paul gave and the one above
>> illustrate that we should not place too much reliance on a single
>> general purpose allocator - especially if we can use information about
>> the likely behaviour an the objects in it. In the first case separate
>> allocators for large and small objects would reduce or elliminate the
>> fragmentation. In my example (allocation of a collection of large
>> numbers of small objects which have a relatively short life)
>> performance was vastly improved by writing a custom allocator (Ok, Ok,
>> yes I *know* ...) which managed a heap in a "mmaped" region - when the
>> mesh was finished with it was returned to the OS (thus eliminating
>> both the redundant page out and page in).
>>
>> With current OSes (ok I'm thinking mostly of Unix) we probably need
>> three allocators
>>
>> 1) General purpose for small-medium size short-long lived objects
>> 2) A large object allocator
>> 3) An allocatpor for large collections of short lived objects.

I partly agree, but I think that, by and large, we'll be able to get by
pretty well in most cases with a single allocator that combines multiple
strategies, which come into play at in different situations.

For example, a best fit allocator may turn out (or may not---nobody knows
yet) to effectively combine some of these strategies already, which
may account for its generally good performance. (See my allocator survey
for some somewhat informed speculation about this.) Small changes may
enhance these tendencies without making them defeat each other.

>> Separating large and small objects should reduce fragmentation, being
>> able to signal to the memory management system that we're no longer
>> interested in the data a section of memory contains helps reduce
>> unwanted write-backs and re-loads (both for paged systems and at a
>> finer level for caches).

Right. I think that we may be able to get most of this effect without
actually depending on funky memory hierarchy features. An allocator
that exhibits generally good clustering and which tends to reuse
recently-freed memory soonest will tend to page out pages that either
have live data that hasn't been touched for a while, or which are actually
empty (contain no live data anymore) but aren't being reused. In the
short term, at least, the costs of these pageouts may be small, although
in the longer term, the increasing relative cost of pageouts (compared
to CPU speed, etc.) may make it worth telling the OS that whole pages
have been discarded. It depends on locality issues that aren't understood.

>IDMS-DC (a TPM for mainframes) seems to get considerable mileage from
>distinguishing short- and long-term memory. Short-term is allocated from the
>start and long-term from the end of the storage pool (effectively the global
>public heap). Short-term is for program data (ie a few micro-seconds - in
>IDMS-DC programs do their stuff then tell the OS to wait for the user to do
>something by setting up an event control block for the terminal, then exit) and
>long-term for saving data for reuse when the user presses enter (eg db locks,
>previous screen images). Of course this needs a preallocated fixed size storage
>area - IDMS-DC has a screen of diagnostics eg high water marks so that the
>DBAs can keep things happy.

Very interesting. Is there any more detailed description anywhere of how
this system was developed, and what experiments (formal or informal) were
done? At this point, most of the older allocator research is not really
very useful, because it studied synthetic and supposedly "typical"
programs, but I think that was a big mistake. At this point, anecdotes
are actually much more useful as a hint as to what kinds of program/allocator
interactions are likely to occur in practice.

>Just another perspective.

And a welcome one.

>Yours, Patrick

Paul

Paul Flinders

unread,
Aug 7, 1995, 3:00:00 AM8/7/95
to
In article <3vr85r$7...@fido.asd.sgi.com> mik...@engr.sgi.com (Mike McDonald) writes:

> A page fault or cache miss between accessing the length and the
> contents cost one heck of a lot! I wouldn't be surprised that with
> a lot of allocators, the miss happens often enough to cause the 3x
> difference in speed overall.

Ok, that's possible - but if you're doing much string manipulation
then you probably have more than one string - in which case arrange
the storage so that the class objects which hold the lengths are close
together in memory - then they're likely to stay in core.

TTFN

Paul.

Hans Boehm

unread,
Aug 7, 1995, 3:00:00 AM8/7/95
to
p...@cs.umass.edu ( Robin Popplestone ) writes:

>The longest running, biggest mix, garbage collected systems are probably
>those in which the operating system itself is subject to the same garbage
>collector as user-programs. Experience with the Multipop time-sharing
>system over 2 decades ago showed that compaction was necessary in the

>circumstances pertaining then, at least in a small machine (384KB). Both
>data and code-blocks were garbage collected, and the job-mix included
>robotic programs that had largish digitised images as well as classic
>list processing such as the (original) Boyer-Moore theorem prover.
>The garbage collector operated in-situ (no virtual memory!), but had a
>compaction pass which was conditionally invoked. Without compaction, the
>system would rapidly become unusable and have to be rebooted. With
>compaction it would run for a (small) number of days.

I think this all depends on so many factors (e.g allocation algorithms,
presence of VM, scarcity of physical memory) that it is very hard to
generalize. My Cedar environment stays up for at least weeks
with a purely noncompacting collector. Though it runs as a
group of SunOS processes, it also has thread switcher and many client
threads (mailers, editors, etc.) running in one address space. I don't
notice any performance deterioration. But the heap rarely has more than
25 MB or so live data, and recently I've always had at least 64MB of
physical memory.

>How this experience would scale to a modern environment is not obvious.

Agreed.

I suspect that the right kind of compaction is occasionally useful. But
you can go a long way without it.

Scott Wheeler

unread,
Aug 7, 1995, 3:00:00 AM8/7/95
to
In Article <807568...@wildcard.demon.co.uk> Cyber Surfer writes:
>I prefer to just let a compiler handle all this, but as far as I'm
>aware, there's still no "standard" C++ string class. Please correct
>me if I'm wrong, as it would be news to me.

See PJ Plauger "The draft ANSI C++ library". This part of the draft can
be expected not to change significantly. It's implemented in the
Borland compiler.

Scott

Scott Wheeler

unread,
Aug 7, 1995, 3:00:00 AM8/7/95
to
In Article <hbaker-0408...@192.0.2.1> Henry Baker writes:
>The Stroustrup/Coplien school advocates using 2 mallocated objects in
>C++ when one will do in C, and this thread already described many of
>the horrors of malloc inefficiencies.

Only one allocation of heap memory (not *malloc*, please - that's C)
unless you are using "new" to generate the string, because the top
level of the object will go on the stack. Since temporaries and
automatic variables go on the stack, this makes a substantial
difference.

>So the point of the posting was that the Stroustrup/Coplien style
>_guarantees_ bad performance,

I've yet to see any hard data posted. The 3x quoted repeatedly here has
not yet been shown to be more than a guess.

Scott

Henry Baker

unread,
Aug 7, 1995, 3:00:00 AM8/7/95
to
In article <jyw...@bmtech.demon.co.uk>, Scott Wheeler
<sco...@bmtech.demon.co.uk> wrote:

> In Article <hbaker-0408...@192.0.2.1> Henry Baker writes:
> >The Stroustrup/Coplien school advocates using 2 mallocated objects in
> >C++ when one will do in C, and this thread already described many of
> >the horrors of malloc inefficiencies.
>
> Only one allocation of heap memory (not *malloc*, please - that's C)
> unless you are using "new" to generate the string, because the top
> level of the object will go on the stack. Since temporaries and
> automatic variables go on the stack, this makes a substantial
> difference.

Excuse me?? You should read the Stroustrup/Coplien books again. They
suggest newing/mallocating everything, & putting ref counts in the string
'headers'. Your scheme could conceivably work, if one were willing to
copy around both the ptr to the string chars _and_ the length field, but
such an implementation would make a string 'handle' bigger than a usual
C++ pointer, and thereby cause a number of other problems with the C++
type system. Furthermore, one would _not_ want to copy any ref counts
with such a scheme. In any case, I haven't seen this particular variant
recommended in any of the usual books.

> >So the point of the posting was that the Stroustrup/Coplien style
> >_guarantees_ bad performance,
>
> I've yet to see any hard data posted. The 3x quoted repeatedly here has
> not yet been shown to be more than a guess.

The double allocator hit would be 2X by itself, all other things being
equal. If the potentially greater fragmentation of the free list caused
a lengthening of freelist searches, then this would show up in further reduced
allocator and deallocator performance.

Accessing the string itself is at least 2X slower, due to the double
indirection.

As Paul Wilson pointed out, the 'rounding up' of each object to at least
16 bytes could cost severely in memory hierarchy costs (caching, VM, etc.),
and such costs would be most evident in the case where most strings were
short (i.e., <16 bytes, the most usual case).

Yes, my 3x guess is just a guess, but a well-educated guess based on
lots & lots of experience.

Cyber Surfer

unread,
Aug 7, 1995, 3:00:00 AM8/7/95
to
In article <jyw...@bmtech.demon.co.uk>
sco...@bmtech.demon.co.uk "Scott Wheeler" writes:

> See PJ Plauger "The draft ANSI C++ library". This part of the draft can
> be expected not to change significantly. It's implemented in the
> Borland compiler.

Is that a book? If so, do you have the ISBN?

Any idea when MS or Symantec will implement it? Sadly, I can't choose
which compiler I use, at the moment, and Borland isn't on the list.

Thanks.

David Chase

unread,
Aug 7, 1995, 3:00:00 AM8/7/95
to
hba...@netcom.com (Henry Baker) writes:

> This is the point of the thread, but I was simply pointing out that
> Stroustrup/Coplien acknowledge that their school has bad allocator
> performance, and recommend that users write their own mallocator.
>
> So you have now gone on record as disagreeing with Stroustrup & Coplien.

These things happen. I imagine I'll survive the experience.
However --

Seeing as how there's been a perfectly adequate 32-bit Unix memory
allocator available from

ftp.parc.xerox.com:pub/gc/gc.tar.Z

since about 1988, it seems like it would be unwise to write your
own. (This memory allocator is also a garbage collector, but it is
trival to turn off the garbage collector and use it instead as a
fast, low-overhead malloc-style memory allocator. At the time, it
was faster than ANY vendor-supplied malloc I could find, and it has
a very low overhead as well). The fact that this is possible was
mentioned in Boehm and Weiser's Software Practice and Experience
paper (late 1988) where they describe use of garbage collection
techniques to implement a leak detector.

[ This is also available (I think) with real live support as "Great
Circle" from Geodesic Systems (http://www.geodesic.com, or
sa...@geodesic.com). ]

Furthermore, even if you have to write a memory allocator, very
reasonable advice on how-to was given as long ago as 1977, and
allocators designed according to that advice are still among the
best available (Norman Nielsen, "Dynamic Memory Allocation in
Computer Simulation", CACM 20:11, November, 1977). This paper was
no big secret -- I remember being pointed to it in a graduate-level
OS principles class when I was an undergraduate. It was in that
same class that the badness of roving-fit (and why) were first
described to me as well, so, again, this was not a big secret either.

Cyber Surfer

unread,
Aug 7, 1995, 3:00:00 AM8/7/95
to
In article <405i5n$e...@news.parc.xerox.com>
bo...@parc.xerox.com "Hans Boehm" writes:

> See ftp://parcftp.xerox.com/pub/gc/gc.html for pointers to a couple
> of commercial garbage collectors for C/C++ (as well as a description
> of a freely available one). I know of no commercial C/C++
> implementations that include GC support. However, I believe NAG
> Fortran 90 does.

I'll make a note of that, in case I ever find a use for Fortran 90,
or any Fortran. Thanks for the C/C++ GC pointers.

I FTP'd your GC last year, but I didn't have time to rewrite the
makefile so I could build it. I'd still like to play with it, someday.

Hans Boehm

unread,
Aug 7, 1995, 3:00:00 AM8/7/95
to
Cyber Surfer <cyber_...@wildcard.demon.co.uk> writes:

>I needed one with a GC, and as I couldn't get a commercial one, I write
>my own. If you know of a C/C++ vendor that includes an alternative to
>the malloc/free, new/delete style of allocation, please let me know.
>I'll certainly be curious, and if the system includes support for a
>platform that I'm developing for, then I may be able to use it.

See ftp://parcftp.xerox.com/pub/gc/gc.html for pointers to a couple


of commercial garbage collectors for C/C++ (as well as a description
of a freely available one). I know of no commercial C/C++
implementations that include GC support. However, I believe NAG
Fortran 90 does.

Hans-J. Boehm
(bo...@parc.xerox.com)
Standard disclaimer ...

Hans Boehm

unread,
Aug 7, 1995, 3:00:00 AM8/7/95
to
hba...@netcom.com (Henry Baker) writes:

>I would contend that the vast majority of strings are not only fixed
>length, but _constant_, and since there is no need to grow the string,
>there is also no need to force a second indirection. The problem with

>C++ is that there is no sanctioned way to store the length field contiguously
>with the characters themselves without a lot of casting to get around
>the type system. You basically have to go back to C mode and perhaps
>even call malloc directly. You lose all of the 'advantages' of the C++
>type system and ctors/dtors.

It still seems to me that we arguing essentially about the best way to
implement the wrong interface.

We seem to agree that most strings are (or should be) constant. But
presumably they are not explicit constants in the program source. So
they are computed somehow and then not modified. Typically they are
computed as the concatenation of other strings, as the substring of
another string, or by some combination of these, starting from program
constants, integer-to-string conversions, etc.

This argues that we should be concerned with:

1. Immutable strings.
2. Concatenation time and space requirements.
3. Substring time and space requirements.
4. String access time requirements.

String access time is linear for all reasonable implementations.

Concatenation takes linear time for all the proposals discussed here.
That means it takes quadratic time to build up a string by repeated
concatenation. It can take basically constant time per operation if we use
a more structured string representation in which a string is a tree of
concatenations. In this case, the concatentation operation doesn't
have to copy its arguments; the result shares them. There is a potentially
exponential difference between the space required by conventional
flat strings, and that required by trees.

Similarly, the substring operation can take either constant or logarithmic
time if allow it to share storage with the original string. This requires
at least one level of indirection. Using the standard representation it
requires either linear time or a special substring class with one level of
indirection.

None of this is new. Cedar has used such strings ("ropes") for 10 years.
A C based implementation ("cords") is included with our garbage collector.

So we're ignoring a potentially exponential complexity difference, and
we're concentrating on the factor of 2 in the asymptotically inferior
solution? Why?

Dennis O'Connor -FT-~

unread,
Aug 7, 1995, 3:00:00 AM8/7/95
to

jb...@synopsys.com (Joe Buck) writes:

] Scott Wheeler <sco...@bmtech.demon.co.uk> writes:
] >] Surely you are not claiming that a Pascal/BASIC-type string arrangement
] >] in memory is going to run at least 3x faster than a typical C++ string?
] >] This strains credibility.
]
] doco...@sedona.intel.com (Dennis O'Connor -FT-~) writes:
] >"strlen" sure would. :-)
]
] Dennis, in the ANSI/ISO draft standard for C++ "string" refers to a
] class object, not to the standard C null-terminated character array.

See what happens when you mix overloaded terms with in-progress
entry into the discusion ? And do you mean "strings" or "Strings" ?

To be honest, I haven't paid much attention to the various
additional class libraries that are floating around. I pretty
much stick to the stuff in the ARM and our home-grown purpose-built
classes. Libraries like NIHCL never seemed to have _exactly_ the
classes I needed ...
--
Dennis O'Connor doco...@sedona.intel.com
i960(R) Architecture and Core Design Not an Intel spokesman.
TIP#518 Fear is the enemy.

Cliff Click

unread,
Aug 7, 1995, 3:00:00 AM8/7/95
to
bo...@parc.xerox.com (Hans Boehm) writes:

> So we're ignoring a potentially exponential complexity difference, and
> we're concentrating on the factor of 2 in the asymptotically inferior
> solution? Why?

4. We didn't know ["ropes", "cords"] existed.
3. Strings aren't a bottleneck, so we don't care.
2. We are worried that the constant factor of the asymptotically better
solution will be too large, 'cause we'll never hit the bad case in
our applications.

and the number 1 reason is:

1. We learned C strings first, so that's what we'll always use.

( Triple smileys all around, except for the #1 reason :-P )


Cliff

--
Cliff Click Compiler Research Scientist
Cambridge Research Office, Hewlett-Packard Laboratories
One Main Street, 10th Floor, Cambridge, MA 02142
(617) 225-4915 Work (617) 225-4930 Fax
cli...@hpl.hp.com http://bellona.cs.rice.edu/MSCP/cliff.html

Henry Baker

unread,
Aug 7, 1995, 3:00:00 AM8/7/95
to
In article <405k8h$e...@news.parc.xerox.com>, bo...@parc.xerox.com (Hans
Boehm) wrote:

> hba...@netcom.com (Henry Baker) writes:
> So we're ignoring a potentially exponential complexity difference, and
> we're concentrating on the factor of 2 in the asymptotically inferior
> solution? Why?

Because for most small strings, your O() arguments are completely washed
out by constant factors. Most of the concatenation I have seen done
in C programs is done by either sprintf or strcat, and these strings are
typically quite small. Therefore, the allocation/deallocation overhead
swamps the differences in complexity that you mention.

There are programs that waste a lot of time doing string manipulation. I
would imagine that most of the Perl-like stuff is quite non-optimal. If
people aren't using better representations for these kinds of strings,
it's probably because implementing the algorithm as a string-processing
algorithm is not such a hot idea in the first place. In other words,
someone would be better off changing the algorithm completely, rather
than worrying about some non-optimal string representation.

A good example is symbolic algebra. You can get along for a while
with a representation that does no sharing at all. But the hideous
complexity of most symbolic algebra algorithms forces most systems to
utilize better representations. I'm not sure that a 'string' representation
would ever be competitive with a more sophsticated representation for
symbolic algebra.

Robin Popplestone

unread,
Aug 7, 1995, 3:00:00 AM8/7/95
to
Keywords:
Cc:wil...@cs.utexas.edu

Paul Wilson writes:

> The fact that this system could run for a day or more without
> accumulating pathological fragmentation is a good sign, up to a point.

I think you have misunderstood me - my earlier message stated:

> Without compaction, the system would rapidly become unusable and have to
> be rebooted. With compaction it would run for a (small) number of days.

That is *terminal* fragmentation set in in about an hour of usage if the
compaction algorithm was not in use. With the compaction algorithm, it
would in principle run indefinitely (given that users who exceeded their
allocated resources could be killed). However there was a bug somewhere in
the system (we suspected the hardware) which resulted in a mean time
between failure of something over 1 day. Even programs like the Boyer-Moore
prover, whose data-usage was mostly list-cells, would cause fragmentation as
they were incrementally re-compiled during debugging because the code-blocks
were of disparate sizes

It is possible that Multipop will be revived in the Royal Scottish
Museum, which has acquired the necessary hardware to do so. The (paper)
tapes of at least one version of the system also exist, so
the performance can possibly be verified.

Paul Wilson also writes:

> I find this pretty interesting. I take it this was a unified heap,
> like in a Lisp machine system, where you don't have the usual process
> memory boundaries, but unlike a Lisp machine in that you seldom do
> compaction very often.

...

From (my wetware) memory (this could in principle be verified from
extant listings):

The memory belonging to each process was NOT segregated. Security
depended solely on the fact that the one compiler (for POP-2) generated
well behaved code of a quality roughly comparable to a simple LISP
compiler. Store usage by a user was accounted for by the garbage collector,
which starting from process records, added the size of each marked object
to a variable. A user exceeding his/her allocation could be killed by the
machine operator (!) setting a sense-key on the console; this was
generally done if the machine seemed to be g.c. thrashing.

> Unified heaps for multiple processes are interesting, because nobody
> knows whether the intermixing of objects created and used for different
> concurrent processes tends to increase fragmentation, or reduce it,
> or what. (And I think that's likely to be very allocator dependent.)

The earliest design for Multipop68 called for a "zoo of objects in
cages", i.e. each class of object would be held in a separate area of
store, so that type could be determined by location. Thus user-specific
data-types would have been separated, though common types (e.g. lists,
strings) would have been put in the same cage. This "zoo" was abandoned in
favour of the "key cell" system (q.v.), due to Mike Foster, partly because
of the greater complication of the zoo, and partly because we suspected
that any savings arising from the encoding of type in the address would be
lost in the existence of unusable space in partly-full cages.

A more conventional time sharing system (Multipop 70) was
developed for the target machine. It was universally disliked
for bad performance. However that arose from the limitations of the discs,
combined with a small main memory, so has no relevance to the
fragmentation issue.

> Is there a more detailed description of the allocator strategy? (E.g.,
> did the allocator attempt to separte objects created by different
> processes, or just mix them together in allocation order, or what?)

The allocator maintained separate free lists for small objects (up to about
five words in length I recall). Longer objects were allocated by searching
a free list until one of the right length or larger was found. These free
lists were global to the system. If no object was found, a garbage
collection was started. (This meant of course that a user who turned over
store rapidly got an unfair share of machine resources).

I recall that we used a dedicated area of store to hold the
old-address->new address map for the purposes of compaction - thus more
than one compaction might be needed to recover severely fragmented space.

Each object in memory had one pointer to a "key cell" which contained
data-class-common information, including two methods needed by the garbage
collector, a mark-and-trace method and a relocate method (we did not call
them methods of course...). A data-class whose members varied in length
(e.g. strings) had an additional field in each instance which specified the
length. Users could define their own data-classes.

A process record contained (a) 2 stacks for each user (b) a dictionary
mapping identifiers to store locations (all variables were in the LISP
sense special, with closures done by what was later to be called "lambda
lifting") (c) a copy of the special variables of the compiler. A context
switch involved saving the special variables (including the PC) of the user
being swapped out, and restoring those of the user being swapped in.

Thus each user had one process. Later (around 1970) continuations were
added to POP-2, but were never used as the basic process mechanism of
Multipop, being decidedly inefficient given the non-lexical nature of
variables.

The allocation was complicated by boundaries in the machine architecture -
references (variable-value slots) had to be in the 1st 32k words, and
program within the 1st 64k. In effect, this was a partial revival of the
"zoo" architecture.

Interrupt handling in Multipop was conventionally coded in assembly
language. The interface to the operating system used function-closures to
provide security. For example, a disc-file was presented as a closure whose
data-part was a buffer, and whose function part was the code to extract
characters from the buffer.

Robin Popplestone.

Robin Popplestone

unread,
Aug 7, 1995, 3:00:00 AM8/7/95
to

Hans-J. Boehm (bo...@parc.xerox.com) writes in
<4044qp$b...@news.parc.xerox.com>

> I think this all depends on so many factors (e.g allocation algorithms,
> presence of VM, scarcity of physical memory) that it is very hard to
> generalize. My Cedar environment stays up for at least weeks with a purely
> noncompacting collector. Though it runs as a group of SunOS processes, it
> also has thread switcher and many client threads (mailers, editors, etc.)
> running in one address space. I don't notice any performance deterioration.

One crude measure of the effect of fragmentation is:

size of store
-------------
size of biggest stored object

This was about 500 for Multipop, assuming we stored visual data for the robot
by separate rows. For Cedar, I imagine it could be over a hundred thousand,
though you don't tell us how for example the editors store their text.
However, those who speak of megabyte single objects are certainly speaking of
systems that could get seriously fragmented.

> But the heap rarely has more than 25 MB or so live data, and recently I've
> always had at least 64MB of physical memory.

Multipop often ran close to having no free memory whatever, at which point
users who exceeded their memory quota would be thrown off. Thus the compactor
had to realise all the available store so that we could use almost all of
it in some applications.

> I suspect that the right kind of compaction is occasionally useful. But
> you can go a long way without it.

In a modern environment, compaction is more a matter of efficiency, determined
by interactions between the garbage collector and the various memory systems.
However systems that can do compaction (or in general relocation of objects)
do possess degrees of freedom that systems that cannot relocate objects
do not possess. This freedom offers potential for enhancing performance.


Robin Popplestone.

Patrick Herring

unread,
Aug 7, 1995, 3:00:00 AM8/7/95
to
wil...@cs.utexas.edu writes in article <400hpm$3...@jive.cs.utexas.edu>:

IDMS was developed by John Cullinane in the early 70s (I think) from the
CODASYL network design. It got a reputation for speed, particularly with
American banks. I don't know when the online side was added but is it
definitely of that era (eg <enter> often does double duty as 'do command'
or 'page forward'). I do know that the online environment (IDMS-DC) uses the
same methods for controlling resources as the DB side (IDMS-DB) uses for
controlling data. The DB side is fast because it minimises physical io by
maximising buffer use and physical efficiency. I would say that the continued
use of IDMS for several decades validates its design! Also the online
environment can stay up for weeks without compromising itself so it must be
doing something right. Since the usual thing is to close the system down once a
day for backups I'm not aware of memory-fragmentation problems. Disk defragging
takes place once in a blue moon since there's always more than enough space
allocated (disk space is cheap). Another point is that DP algorithms are nearly
always O(c) in space since the possibility of having to process a million
records makes doing anything else unreliable; so having a lot of destructive
assignments tends to do away with the need for run-time allocation and garbage
collection, indeed the main way in IDMS-DC of getting more storage is
programmer-controlled and is usually configured by the DBAs to use memory up to
a limit and then go to a special database area (which, being single-threaded
across all tasks is rather expensive) so greedy programs hurt themselves not
the other users.

There is a book on IDMS but no details to hand (it tends to be the kind you see
in remainder sales unfortunately). Date's book (or is it Codd) on DBs has a
small chapter on IDMS as an example of CODASYL network DBs but since he clearly
sees it as old technology that he's sweeping away with his relational stuff he
doesn't say much. Cullinet was bought by CA 10 years ago.

A brief description of IDMS-DB:

A database is a collection of areas. An area is a contiguous range of pages. A
page is the same as a disk block (to get the fastest possible disk io) and has
internal structure of page header, data area, page footer. A page footer has a
backwards-growing list of the page-offsets and lengths of all records stored on
the page. Access is either 'CALC' or 'VIA'. 'CALC' is by hashing a key (one or
more elements of the record) to a page number (the DBMS then walks a pointer
chain in the page to find the right one - the DBAs should make the area size
large enough to avoid duplicated hash hits). Pointers are 32-bit and are
page-number:footer-list-entry (how many bytes for each is configurable at
design time). 'VIA' is by walking pointers (aka 'SETS') starting with the owner
record (which is usually a calc record but doesn't have to be) - these pointers
should be to records in the same page if the DBAs are awake. You can have more
than one record type stored via a calced type. If you think of a DEPT record
stored 'CALC' with EMPLOYEE records stored 'VIA' it, and area and page sizes
being tuned to the data volumes then you can see that with one physical disk
block read you get a page of logically related data at one go, ie everything
needed for one screen or unit of work. This is also useful for locking since it
often means each process is interested in only one buffer page. 'Logically
related' is a matter of design and frequently you can't have all possible
logical relations implemented physically - you have to decide which are the
most common ones and let the rarely used ones be slower to access. This works
well in practice. Other structures: automatically sorted sets, binary tree
indexes. You can't have many-to-many sets because each record holds at most
three pointers for any one set - next, prior and owner - you set up a junction
record type to act as the owner one way and the set members the other way.

I think the general point here is that old-style mainframe technology makes
great use of knowing the fixed limits of things. Eg in DB2 (under SQL) if you
want another data access path you just add indexes. This gives the impression
of great flexibility but the cost in extra io is huge (in relative terms - an
extra io to write when storing and two ios to read). Managers prefer a system
they feel they can change at a moments notice but is that really what's needed
or is it just a stress-reduction exercise? Or, in the context of malloc and
GCs, are we really doing ourselves a favour by struggling to come up with ways
of coping with increasing run-time flexibility when that just soaks up cpu etc
and when the user still has a specific rather than general job in mind. On the
other hand people (including myself) hate having to give estimates for data
volumes etc when it doesn't seem logically necessary - we'd like to be able to
say 'no estimates and I'll take the consequences' or 'no more than 1000
widgets-worth and use that information well'. The arguments against this or
that malloc strategy always seem to be that you don't know this or that
heuristic at the right time. I don't think it's worth trying to make
guesstimates work as well as real information - better to make systems whose
behaviour clearly benefits from better information - then you motivate
people to make their minds up as to what they actually want and give the
program useful guidance. 'Configurability not apparent generality' should be
the slogan. Just my 2p <g>.

Stefan Monnier

unread,
Aug 8, 1995, 3:00:00 AM8/8/95
to
In article <hbaker-0708...@192.0.2.1>,
Henry Baker <hba...@netcom.com> wrote:
] Because for most small strings, your O() arguments are completely washed

] out by constant factors. Most of the concatenation I have seen done
] in C programs is done by either sprintf or strcat, and these strings are
] typically quite small. Therefore, the allocation/deallocation overhead
] swamps the differences in complexity that you mention.

By the way. Is there any paper available discussing the distribution of length
of strings in some programs ?


Stefan

PS: I fell like there are really two (at least) different strings, depeding on
their use. I don't consider a "token" string from a lexer the same as the
text data of a text widget

Scott Wheeler

unread,
Aug 8, 1995, 3:00:00 AM8/8/95
to
In Article <hbaker-0708...@192.0.2.1> Henry Baker writes:
>In article <jyw...@bmtech.demon.co.uk>, Scott Wheeler
><sco...@bmtech.demon.co.uk> wrote:
>
>> In Article <hbaker-0408...@192.0.2.1> Henry Baker writes:
>> >The Stroustrup/Coplien school advocates using 2 mallocated objects
in
>> >C++ when one will do in C, and this thread already described many
of
>> >the horrors of malloc inefficiencies.
>>
>> Only one allocation of heap memory (not *malloc*, please - that's C)
>> unless you are using "new" to generate the string, because the top
>> level of the object will go on the stack. Since temporaries and
>> automatic variables go on the stack, this makes a substantial
>> difference.
>
>Excuse me?? You should read the Stroustrup/Coplien books again. They
>suggest newing/mallocating everything, & putting ref counts in the
string
>'headers'.

The ARM does have an imcomplete reference-counted string class (I don't
have Coplien to hand, or Stroustrop's later books). Of course, this
means one would expect the number of "new"s per stack-based string
object created in the program to be between 0 and 2, dependent on the
program (new, *not* malloc - it simply isn't interchangable with new
and neither S nor C recommend it here).

>Your scheme could conceivably work, if one were willing to
>copy around both the ptr to the string chars _and_ the length field,
>but such an implementation would make a string 'handle' bigger than a
>usual C++ pointer, and thereby cause a number of other problems with
>the C++ type system.

If you are talking about the data representation

class str {
int nLen; // string length
char *pachData; // ptr to character array containing data
};

it's hardly "my" scheme (I use the draft-ANSI string class where
available), it's perfectly legal C++, doesn't subvert the type system,
and it works just fine. There is no reason why it should be of the same
size as a pointer - I'm not proposing some odd cast. You do realise
that there are the usual set of member functions? I omitted them
because they weren't relevant to the present discussion.

>Furthermore, one would _not_ want to copy any ref counts
>with such a scheme. In any case, I haven't seen this particular
>variant recommended in any of the usual books.

No, this is not a reference counted implementation - I though I made
it clear that nLen is a length count (hence no need to use strlen()
etc.). BTW, I came across an article a couple of years back (probably
CUJ) giving a case study of developing and profiling a string class.
The author passed through this version to a reference counted one, and
found that his performance decreased. Probably strongly dependent on
the application, but it does indicate that a non-reference-counted
version is not necessarily naive.

>> >So the point of the posting was that the Stroustrup/Coplien style
>> >_guarantees_ bad performance,
>>
>> I've yet to see any hard data posted. The 3x quoted repeatedly here
has
>> not yet been shown to be more than a guess.
>
>The double allocator hit would be 2X by itself, all other things being
>equal. If the potentially greater fragmentation of the free list
>caused a lengthening of freelist searches, then this would show up in
>further reduced allocator and deallocator performance.

*If* you do a double allocation. The whole point of using a reference
counted version with copy-on-write is that you are guessing that in
most cases you are copying an existing string - hence you only need to
create a new "header" (on the stack) and increment the count in the
"body" (srep struct in the ARM example) - no heap allocation, and
follow one pointer.

And *if* the two allocations are equally costly, i.e. both or neither
cause a page fault.

>Accessing the string itself is at least 2X slower, due to the double
>indirection.

The second point above holds here.

>Yes, my 3x guess is just a guess, but a well-educated guess based on
>lots & lots of experience.

Why guess when you can measure? My guess, founded on experience
profiling Noah's project planning software (the product was successful,
but my company went under), is that you will find it very difficult to
measure a systematic difference, and that if you *can* prove anything,
it will be strongly dependent on the application area.

Scott

Scott Wheeler

unread,
Aug 8, 1995, 3:00:00 AM8/8/95
to
In Article <405k8h$e...@news.parc.xerox.com> Hans Boehm writes:
>It still seems to me that we arguing essentially about the best way to
>implement the wrong interface.
>
>We seem to agree that most strings are (or should be) constant. But
>presumably they are not explicit constants in the program source. So
>they are computed somehow and then not modified. Typically they are
>computed as the concatenation of other strings, as the substring of
>another string, or by some combination of these, starting from program
>constants, integer-to-string conversions, etc.
>
>This argues that we should be concerned with:
>
>1. Immutable strings.
>2. Concatenation time and space requirements.
>3. Substring time and space requirements.
>4. String access time requirements.
>...

>So we're ignoring a potentially exponential complexity difference, and
>we're concentrating on the factor of 2 in the asymptotically inferior
>solution? Why?

A good point. Speaking only for myself, I usually work with programs
where the number of live strings at any given time is low (hence the
allocator locality problem is unlikely to cause thrashing). When
thinking of an example where a large number of strings would be live at
a given time, the symbol table of a compiler was the most obvious
example, and the compilers I seen have done little or no alteration to
symbols after lexing and parsing. I'm not sure what string operations
would dominate in lexing - possibly search for substrings?

I'm afraid the tree implementation is new to me, and I'll try to read
up on it. Presumably if the tree representation has an O(n^2) or O(ln
n) advantage over the doubly indirect method under discussion, there is
also a constant factor to consider, as I assume the logic to implement
the tree is more complex. At roughly how many string operations would
the tree method be expected to be superior?

Scott


Henry Baker

unread,
Aug 8, 1995, 3:00:00 AM8/8/95
to
In article <4059v8$8...@kernighan.cs.umass.edu>, p...@cs.umass.edu ( Robin
Popplestone ) wrote:

[snip of lots of really interesting tidbits]

I _sincerely_ hope that this stuff is written down somewhere. It's such
a waste of brainpower to recreate these things every generation. This
mode makes each generation stand on its predecessor's toes instead of
its predecessor's shoulders.

Also, like the teenager who discovers that his father learned one heck
of a lot between the time the teenager was 15 and the time he was 21, many of us
would be interested in some of the more subtle details of what our
predecessors did, because there may be additional wisdom there that
wasn't evident the first or second time we thought about the problem.

Henry Baker

unread,
Aug 8, 1995, 3:00:00 AM8/8/95
to
In article <kelvin.807660304@kickapoo>, kel...@cs.iastate.edu (Kelvin
Nilsen) wrote:

> p...@cs.umass.edu ( Robin Popplestone ) writes:
>
> >> In article <3vnghm$s...@info.epfl.ch>,
> >> Stefan Monnier <stefan....@epfl.ch> wrote:
> >>In article <hbaker-3107...@192.0.2.1>,
> >>Henry Baker <hba...@netcom.com> wrote:
> >>] IMHO the major gain from compaction is to defragment the address space,
> >>] but as Paul's recent work has shown, one may not have to do this very
> >>] often.
>
> >And Paul Wilson writes in <3vnvr1$9...@jive.cs.utexas.edu>
>
> >> Right. For "most" programs (that have been studied in enough detail, that
> >> is), it looks like you never have to do it at all. Nobody's studied
> >> really long-running programs, though, and it's not clear whether
> >> fragmentation will accumulate over time, or stabilize at an acceptable
> >> level.

I agree with Kelvin about not giving up on compaction completely, and disagree
with Paul that you _never_ have to do it.

While a significant fraction of the best brains in the computer business over
the past 30 years have been aimed at optimizing architectures so that objects
don't have to move (and thereby inadvertently penalizing any programs that
actually _do_ move objects), they haven't completely succeeded at beating
Robson and/or the second law of thermodynamics (in the sense I used it
in my Sigplan Notices paper). Therefore, up until the point at which
your address space is so incredibly trashed that you can't allocate anything
of any reasonably size any more, your program will actually run pretty
well, because the memory hierarchy stuff (VM, cache) are doing all of the
dynamic moving of data for you, without your conscious control, and indeed
without much ability on your part to affect it much even if you wanted to.

The problem, like that of many GC systems, is that 'reorganizing' address
space is a relatively massive and expensive undertaking, and will not
happen very fast. Furthermore, it is the kind of process that memory
hierarchies hate because their locality assumptions fail completely in
the face of such reorganization.

Therefore, unless one can somehow organize this process so that it takes
a relatively small percentage of the system, and doesn't interfere too
much with the normal operation of the program, the cure of reorganization
may be worse than the disease of fragmentation.

However, neither the disease nor the cure need be fatal, but they do have
to be approached with a great deal of subtlety.

If you have enough memory, you can either incrementally copy the stuff
to another processor, and let it do the reorganization for you (easier
in this case than in the reachability GC case), or have the same processor
do a kind of background reorganization task. It is important, IMHO, that
the background task _know_ it is a background task, and be able to inform
the memory hierarchy of its second-class status, so that its activities
will not hurt the more important activities of the application program.

Incremental copying GC's try to do these things in this way (within the
constraints of the HW & OS). The only slight variation is that with the
way the current generation of HW architectures are optimized, it may not
be optimum to copy _all_ the time, but only once in a while.

Scott Wheeler

unread,
Aug 8, 1995, 3:00:00 AM8/8/95
to
In Article <807820...@wildcard.demon.co.uk> Cyber Surfer writes:
>> See PJ Plauger "The draft ANSI C++ library". This part of the draft
can
>> be expected not to change significantly. It's implemented in the
>> Borland compiler.
>
>Is that a book?

Yes, a very useful reference, comments on motivation, and an
implementation ("The ANSI C library", ibid, is also highly
recommended). It doesn't cover the Standard Template Library, a recent
addition to the draft, but since I don't know of a PC compiler with the
extensions to the template functionality required to support the STL,
this isn't a major loss.

> If so, do you have the ISBN?

At home - I'll bring it in tomorrow if no-one answers first.

>Any idea when MS or Symantec will implement it? Sadly, I can't choose
>which compiler I use, at the moment, and Borland isn't on the list.

I think it is possible to licence Plauger's implementation for
commercial use (in source form in the book, and available on disk),
which would be ideal providing MS doesn't choke on the syntax. Other
than that, I'd suggest implementing the parts of the string class that
you are actually using, which shouldn't be too time-consuming, and
extend it to the full spec as needed.

One minor point (probably only of interest if you program for the
Pacific Rim): in the first draft, strings were defined as a normal
class. They are now supposed to be defined as a class template, so that
strings can be produced for wide and long characters. From memory,
PJP's implementation is for the original spec, but it is trivial for
this to become a specialised implementation of the general template. In
any case, the interface for the programmer hasn't changed
significantly.

Scott

Matthew Kennel

unread,
Aug 8, 1995, 3:00:00 AM8/8/95
to
Henry Baker (hba...@netcom.com) wrote:

: The double allocator hit would be 2X by itself, all other things being


: equal. If the potentially greater fragmentation of the free list caused
: a lengthening of freelist searches, then this would show up in further reduced
: allocator and deallocator performance.

: Accessing the string itself is at least 2X slower, due to the double
: indirection.

: As Paul Wilson pointed out, the 'rounding up' of each object to at least


: 16 bytes could cost severely in memory hierarchy costs (caching, VM, etc.),
: and such costs would be most evident in the case where most strings were
: short (i.e., <16 bytes, the most usual case).

: Yes, my 3x guess is just a guess, but a well-educated guess based on


: lots & lots of experience.

This issue also comes up in numerical matrix libraries. It's complicated
by the fact that there is also the issue of 2 dimensional indexing.

The questions are:

*) Should matrix data be stored through an extra indirection?
E.g. have a header and then an internal pointer to the data
itself?

*) Should columns vs rows be accessed through "dope vectors"
i.e. numerical recipes in C style allocation?

For speed, the answers are both "no". But such a solution is difficult
to achieve in C++.

Given the first issue (indirect pointer to data) there are optimization
difficulties in trying to get it automatically elided out.

I believe these were discussed in a recent article in _Computers in Physics_,
whose empirical results demonstrated a typical 3x-4x speed reduction
compared to direct-access Fortran style, just as Mr Baker is predicting.

Some of it could be recovered by very smart optimizations but the language
definition and aliasing problems make this difficult.

<plug> I should point out that the standard Sather implementation does things
the "obvious" way advocated by Mr Baker, and puts the size fields of strings
and matrices along with the variable-sized data in one contiguous allocation
unit.

We hope to be a factor of "1" compared to idiomatic C on matrix indexing
(but with dynamically sized matrices who carry their bounds with them) with
the upcoming 1.1 release. A test doing manual source transformations of the
optimizations intended to be automatic in that release resulted in identical
assembly code being generated for the naive matrix multiply loops for Sather
and native C (compiling using the GNU C compiler). AIX Fortran generated
code that was about 10% faster.

There will also be Sather matrix classes which call the BLAS libraries for
their underlying computation. </plug>

: -- T
: www/ftp directory:
: ftp://ftp.netcom.com/pub/hb/hbaker/home.html

Thomas Wang

unread,
Aug 8, 1995, 3:00:00 AM8/8/95
to
Henry Baker (hba...@netcom.com) wrote:

> Therefore, up until the point at which
> your address space is so incredibly trashed that you can't allocate anything
> of any reasonably size any more, your program will actually run pretty
> well, because the memory hierarchy stuff (VM, cache) are doing all of the
> dynamic moving of data for you, without your conscious control, and indeed
> without much ability on your part to affect it much even if you wanted to.

Precisely. No relocation/compaction is likely to have the best performance
for this reason. Pathological fragmentation is still a problem, which need
to be addressed. Interestingly enough, present day VM hardware is already
sufficient to minimize the fragmentation problem through NOT MAPPING the
free fragments. So the true problem lies in OS software, which provides no
API to portably access this feature. What needs to be done is for the
researchers to present a standard package of API to various OS standard bodies
to make sure wide adoption of various VM access functions.

In one sample implementation, I imagine a non-moving incremental garbage
collector with a software write barrier. This collector uses a binary buddy
system memory allocator, which minimizes external fragmentation. Internal
fragmentation is minimized through VM hardware of not mapping the free
fragments. Sub-page fragmentation is minimized by grouping small objects
into pools. This collector would have good interactive performance, and
should be cache friendly as well.

Regards.

> www/ftp directory:
> ftp://ftp.netcom.com/pub/hb/hbaker/home.html

-Thomas Wang (Software Engineer, Enterprise Objects Program)
wa...@cup.hp.com http://hpeopose.cup.hp.com/~wang/wang.html


Scott Wheeler

unread,
Aug 9, 1995, 3:00:00 AM8/9/95
to
In Article <807820...@wildcard.demon.co.uk> Cyber Surfer writes:
>> See PJ Plauger "The draft ANSI C++ library". This part of the draft
can
>> be expected not to change significantly. It's implemented in the
>> Borland compiler.
>
>Is that a book? If so, do you have the ISBN?

"The Draft Standard C++ Library", PJ Plauger, Prentice Hall,
0-13-117003-1

Scott

James Kanze US/ESC 60/3/141 #40763

unread,
Aug 9, 1995, 3:00:00 AM8/9/95
to
In article <3vr85r$7...@fido.asd.sgi.com> mik...@engr.sgi.com (Mike
McDonald) writes:

|> In article <jyv...@bmtech.demon.co.uk>, Scott Wheeler <sco...@bmtech.demon.co.uk> writes:


|> |> class str {
|> |> int iCount;
|> |> char achData[256];
|> |> };
|> |>
|> |> with the obvious problems in terms of fixed maximum length of the
|> |> string.

|> Yuch! In C, (I don't know how to do it in C++) you'd do something like:

|> typedef struct string {
|> int length;
|> char contents[0];
|> } *String;

Not legal, at least not in C (or C++). Arrays may not have length of
0.

|> String make_string(int n)
|> {
|> String s;

|> s = (String) calloc(1, sizeof(struct string) + n);
|> if (s == NULL)
|> you_lose();
|> s->length = n;
|> return (s);
|> }


|> (If your C compiler doesn't like arrays of length zero, declare contents as
|> length 1 and subtract 1 from the sizeof(struct string).)

But you still cannot *access* the remaining memory. At least not
easily. Accessing contents with an index > 0 is undefined behavior;
with any implementation of reasonable quality, it will cause a bounds
check error. (Of course, I've never seen an implementation of C with
reasonable quality, so you may be safe. Although I seem to have heard
somewhere that there is one, Centerline, maybe?)

Interestingly enough, there is a legal (and safe) way of implementing
this same idiom in C++. It's so ugly, though, that I'm not going to
post it. (I know, if you were worried about ugliness, you wouldn't be
using C++ anyway:-).)

|> This way, both the length and then data are allocated together in the heap. And
|> you haven't limited the max length artifically. (I suspect most BASIC systems do
|> this.)

I once ran benchmarks (with a very simple program, so probably not
indicative of anythinkg) of three reference counted implementations of
C++ strings: the classical one, the classical one with operator new
overloaded for the fixed length header class, and one using the above
ugly hack (and so only one allocation per string). Under Solaris,
using the standard system malloc (called directly by new), there was
no significant difference in runtime.

--
James Kanze Tel.: (+33) 88 14 49 00 email: ka...@gabi-soft.fr
GABI Software, Sarl., 8 rue des Francs-Bourgeois, F-67000 Strasbourg, France
Conseils en informatique industrielle --
-- Beratung in industrieller Datenverarbeitung

Henry Baker

unread,
Aug 9, 1995, 3:00:00 AM8/9/95
to
In article <KANZE.95A...@slsvhdt.lts.sel.alcatel.de>,

ka...@lts.sel.alcatel.de (James Kanze US/ESC 60/3/141 #40763) wrote:

> Interestingly enough, there is a legal (and safe) way of implementing
> this same idiom in C++. It's so ugly, though, that I'm not going to
> post it. (I know, if you were worried about ugliness, you wouldn't be
> using C++ anyway:-).)

Go ahead & post it! Hit us with your ugly stick! We're adults, and
we've turned off the V-chip...

> |> This way, both the length and then data are allocated together in
the heap. And
> |> you haven't limited the max length artifically. (I suspect most BASIC
systems do
> |> this.)
>
> I once ran benchmarks (with a very simple program, so probably not
> indicative of anythinkg) of three reference counted implementations of
> C++ strings: the classical one, the classical one with operator new
> overloaded for the fixed length header class, and one using the above
> ugly hack (and so only one allocation per string). Under Solaris,
> using the standard system malloc (called directly by new), there was
> no significant difference in runtime.

We may be interested in these results after we've seen the program.

Hans Boehm

unread,
Aug 9, 1995, 3:00:00 AM8/9/95
to
hba...@netcom.com (Henry Baker) writes:

>In article <405k8h$e...@news.parc.xerox.com>, bo...@parc.xerox.com (Hans
>Boehm) wrote:

>> hba...@netcom.com (Henry Baker) writes:
>> So we're ignoring a potentially exponential complexity difference, and
>> we're concentrating on the factor of 2 in the asymptotically inferior
>> solution? Why?

>Because for most small strings, your O() arguments are completely washed


>out by constant factors. Most of the concatenation I have seen done
>in C programs is done by either sprintf or strcat, and these strings are
>typically quite small. Therefore, the allocation/deallocation overhead
>swamps the differences in complexity that you mention.

The breakeven points for string concatenation are actually very small.
My measurements suggest it's at about 10 characters. You do lose a
small constant factor (on the order of 2 with a fast implementation,
a bit more if you use the simplest C interface) in traversal time.
(See Boehm, Atkinson, Plass, "Ropes Are Better Than Strings",
PARC CSL-94-10 for details. A revised version should appear in
SP&E shortly.)

For most programs I would greatly prefer a version that's a little
slower on typical input, but uses asymptotically good algorithms.
Such a program is much more robust. I can expect it to keep working
reasonably if I feed it input that wasn't used for testing by the
program designer.

The program that's tuned for short strings with asymptotically
suboptimal algorithms will usually work fine. But then it may start
taking hours instead of seconds when I increase the size of some input
that should have minimal effect on its running time. For example,
I recently spent several hours tracing dismal mouse selection
response under rxvt to the size of the history buffer. Apparently,
it was meant to be "small", and I set it at 5000 lines, like I always
do. I haven't looked at the code, so I still don't understand why
that should matter.

>There are programs that waste a lot of time doing string manipulation. I
>would imagine that most of the Perl-like stuff is quite non-optimal. If
>people aren't using better representations for these kinds of strings,
>it's probably because implementing the algorithm as a string-processing
>algorithm is not such a hot idea in the first place. In other words,
>someone would be better off changing the algorithm completely, rather
>than worrying about some non-optimal string representation.

I don't think you should use string manipulation where it's inappropriate.
But many programs (editors, mailers, news readers, etc.) naturally
manipulate long strings. Currently they mostly invent their own custom
representations, except where the authors decided not to bother. In those
places they effectively break on unexpectedly long input. Why not use
a robust representation as the standard, and then optimize in those
few places where it doesn't perform adequately, and it's safe to
optimize?

Cyber Surfer

unread,
Aug 9, 1995, 3:00:00 AM8/9/95
to
In article <jyw...@bmtech.demon.co.uk>
sco...@bmtech.demon.co.uk "Scott Wheeler" writes:

> >Is that a book?
>
> Yes, a very useful reference, comments on motivation, and an
> implementation ("The ANSI C library", ibid, is also highly
> recommended). It doesn't cover the Standard Template Library, a recent
> addition to the draft, but since I don't know of a PC compiler with the
> extensions to the template functionality required to support the STL,
> this isn't a major loss.

Yes, the lack of such a compiler is a major pain. It's made worse
by a lack of _choice_ of compiler, but that's another matter. I can
at least choose which books I read. ;-)



> > If so, do you have the ISBN?
>

> At home - I'll bring it in tomorrow if no-one answers first.

Excellent. Thanks.



> >Any idea when MS or Symantec will implement it? Sadly, I can't choose
> >which compiler I use, at the moment, and Borland isn't on the list.
>
> I think it is possible to licence Plauger's implementation for
> commercial use (in source form in the book, and available on disk),
> which would be ideal providing MS doesn't choke on the syntax. Other
> than that, I'd suggest implementing the parts of the string class that
> you are actually using, which shouldn't be too time-consuming, and
> extend it to the full spec as needed.

Well, I'll consider that. It'll depend on how much time it takes,
of course. Not everything that is possible is also practical, but
it's worth a try.



> One minor point (probably only of interest if you program for the
> Pacific Rim): in the first draft, strings were defined as a normal
> class. They are now supposed to be defined as a class template, so that
> strings can be produced for wide and long characters. From memory,
> PJP's implementation is for the original spec, but it is trivial for
> this to become a specialised implementation of the general template. In
> any case, the interface for the programmer hasn't changed
> significantly.

Thanks.

Bruce Hoult

unread,
Aug 10, 1995, 3:00:00 AM8/10/95
to
Scott Wheeler <sco...@bmtech.demon.co.uk> writes:
> Yes, a very useful reference, comments on motivation, and an
> implementation ("The ANSI C library", ibid, is also highly
> recommended). It doesn't cover the Standard Template Library, a recent
> addition to the draft, but since I don't know of a PC compiler with the
> extensions to the template functionality required to support the STL,
> this isn't a major loss.

On the Macintosh, the latest versions of both Symantec C++ and Metrowerks
CodeWarrior have come with the STL since about May.

-- Bruce

Brian N. Miller

unread,
Aug 10, 1995, 3:00:00 AM8/10/95
to
|hba...@netcom.com (Henry Baker) writes:
|
|> This is the point of the thread, but I was simply pointing out that
|> Stroustrup/Coplien acknowledge that their school has bad allocator
|> performance, and recommend that users write their own mallocator.
|>
|> So you have now gone on record as disagreeing with Stroustrup & Coplien.

I'm generally a fan of Stroustrup, but his reference counting
allocation proposal sucks:

1) It expects application programmers to flawlessly obey the reference
counting protocol at the source code level. What a pain in the ass.
There's just to much opportunity for mistakes.

2) It's not well suited for concurrency. Garbage collection can be run
as a background thread. Reference count triggered deallocation is
not so readily parallelized.

3) It lacks compaction. This is a big miss for some applications, and
so I question its general applicability.

4) Recommending users write their own allocator in desparation is just
silly. Bootstrapping the heap is one thing a language's audience
should almost never have to do.

James Kanze US/ESC 60/3/141 #40763

unread,
Aug 10, 1995, 3:00:00 AM8/10/95
to
In article <hbaker-0908...@192.0.2.1> hba...@netcom.com
(Henry Baker) writes:

|> In article <KANZE.95A...@slsvhdt.lts.sel.alcatel.de>,
|> ka...@lts.sel.alcatel.de (James Kanze US/ESC 60/3/141 #40763) wrote:

|> > Interestingly enough, there is a legal (and safe) way of implementing
|> > this same idiom in C++. It's so ugly, though, that I'm not going to
|> > post it. (I know, if you were worried about ugliness, you wouldn't be
|> > using C++ anyway:-).)

|> Go ahead & post it! Hit us with your ugly stick! We're adults, and
|> we've turned off the V-chip...

Just the essentials (and ignoring eventual reference counting):

class StringImpl
{
public :
static StringImpl* mkStringImpl( int length ) ;
char* buffer() ;

void* operator new( size_t s , int length ) ;
void operator delete( void* ) ;
private :
StringImpl( int length ) ;
int theLength ;
} ;

StringImpl*
StringImpl::mkStringImpl( int length )
{
return new( length ) StringImpl( length ) ;
}

char*
StringImpl::buffer()
{
return (char*)( this + 1 ) ;
}

void*
StringImpl::operator new( size_t s , int length )
{
assert( s == sizeof( StringImpl ) ) ;
return ::operator new( s + length ) ;
}
void
StringImpl::operator delete( void* p )
{
::operator delete( p ) ;
}

StringImpl::StringImpl( int length )
: theLength( length )
{
}

The purpose of the private constructor (and the static mkStringImpl)
is to ensure that 1) all elements *are* created on the heap, and 2)
the extra parameter to new and the parameter to the constructor are
identical. (Safety, in sum.)

In my tests, all of the functions were made inline.

|> > |> This way, both the length and then data are allocated together in
|> the heap. And
|> > |> you haven't limited the max length artifically. (I suspect most BASIC
|> systems do
|> > |> this.)
|> >
|> > I once ran benchmarks (with a very simple program, so probably not
|> > indicative of anythinkg) of three reference counted implementations of
|> > C++ strings: the classical one, the classical one with operator new
|> > overloaded for the fixed length header class, and one using the above
|> > ugly hack (and so only one allocation per string). Under Solaris,
|> > using the standard system malloc (called directly by new), there was
|> > no significant difference in runtime.

|> We may be interested in these results after we've seen the program.

I don't have the benchmark anymore, and forget exactly what I tested
(in addition to just making the strings). I do remember being
surprised by the results, since the malloc under SunOS 4 is not
particularly efficient when a lot of small blocks are being allocated.

Henry Baker

unread,
Aug 10, 1995, 3:00:00 AM8/10/95
to
In article <KANZE.95A...@slsvhdt.lts.sel.alcatel.de>,
ka...@lts.sel.alcatel.de (James Kanze US/ESC 60/3/141 #40763) wrote:

> In article <hbaker-0908...@192.0.2.1> hba...@netcom.com
> (Henry Baker) writes:
>
> |> In article <KANZE.95A...@slsvhdt.lts.sel.alcatel.de>,
> |> ka...@lts.sel.alcatel.de (James Kanze US/ESC 60/3/141 #40763) wrote:
>
> |> > Interestingly enough, there is a legal (and safe) way of implementing
> |> > this same idiom in C++. It's so ugly, though, that I'm not going to
> |> > post it. (I know, if you were worried about ugliness, you wouldn't be
> |> > using C++ anyway:-).)
>
> |> Go ahead & post it! Hit us with your ugly stick! We're adults, and
> |> we've turned off the V-chip...
>
> Just the essentials (and ignoring eventual reference counting):
>
> class StringImpl
> {
> public :
> static StringImpl* mkStringImpl( int length ) ;
> char* buffer() ;

// Gain access to the actual string buffer.

> void* operator new( size_t s , int length ) ;

// Overlard misplacement 'new'.

> void operator delete( void* ) ;
> private :
> StringImpl( int length ) ;

// private ctor means no static and/or stack instances; being 'private' means
// that derived classes can't create any, either. No default ctor means no
// arrays or members. No friends declared. _Do_ have default assignment,
// so it isn't as safe as you might think.

> int theLength ;
> } ;
>
> StringImpl*
> StringImpl::mkStringImpl( int length )
> {
> return new( length ) StringImpl( length ) ;
> }
>
> char*
> StringImpl::buffer()
> {
> return (char*)( this + 1 ) ;

// This may not work so well for arrays of doubles due to alignment problems.

> }
>
> void*
> StringImpl::operator new( size_t s , int length )
> {
> assert( s == sizeof( StringImpl ) ) ;
> return ::operator new( s + length ) ;
> }
> void
> StringImpl::operator delete( void* p )
> {
> ::operator delete( p ) ;
> }
>
> StringImpl::StringImpl( int length )
> : theLength( length )
> {
> }
>
> The purpose of the private constructor (and the static mkStringImpl)
> is to ensure that 1) all elements *are* created on the heap, and 2)
> the extra parameter to new and the parameter to the constructor are
> identical. (Safety, in sum.)
>
> In my tests, all of the functions were made inline.

You're right, it's ugly. It's also no safer than something using explicit
casting. For some strange reason :-), it _is_ efficient (at least this part,
w/o reference counting), since all of this stuff can be inlined. The mere fact
that compiler writers have been so accomodating in this kind of thing is proof
positive that people have to do this kind of end-run around the type system
a lot. (Or at least people at ATT have to do this a lot!)

But I think that most people would agree that you've subverted the C++ type
system to achieve your end, so 'legal' here would mean a language lawyer's
'legal', rather than 'ethical'.

Paul Johnson

unread,
Aug 10, 1995, 3:00:00 AM8/10/95
to
Paul Wilson (wil...@cs.utexas.edu) wrote:

> To be more concrete, I think that phase behavior is quite important to
> locality. Objects created at *roughly* the same time are likely to be
> used for similar purposes, but maybe not if they are of different *types*.
> If they are of different types, they are likely to be used for different
> purposes---e.g., some may hold intermediate values of a computation, while
> others hold results that outlive the phase that creates them.

> Assuming we don't know the types of objects being allocated, as conventional
> allocators don't, it's reasonable to use size heuristically to discriminate
> between different types of objects created at about the same time.

How about looking at the call stack? You can identify the allocate
call by looking at your return address. Then localise objects
according to that. You could extend this scheme by hashing the last
few return addresses together to get some idea of the context.

I have sometimes wondered about an instrumented GC which records
allocation and reference patterns in order to find out which
allocations are used to create objects which link to which other
objects, so that they can be grouped together. This data could be
collected for a number of test runs and then used to generate
optimisation tables for the production version. I don't think that
this could use the return addresses to identify things (since that
will change between runs), but the generated code could maintain
something similar.

Does this make sense?

> It's not as good as real type information, but it'll probably get most of
> the same benefit. (In my personal opinion, that is. Soon I may have data
> that say whether I'm right.)

Identifying the allocation context will probably give better value.
Consider a linked list class which uses chains of cons-cells. It is
best if each discrete list is co-located, rather than putting all the
cons-cells into a single huge pool. Context-based allocation would be
able to do this.


Paul.

--
Paul Johnson | GEC-Marconi Ltd is not responsible for my opinions. |
+44 1245 473331 ext 2245+-----------+-----------------------------------------+
Work: <paul.j...@gmrc.gecm.com> | You are lost in a twisty maze of little
Home: <Pa...@treetop.demon.co.uk> | standards, all different.

James Kanze US/ESC 60/3/141 #40763

unread,
Aug 10, 1995, 3:00:00 AM8/10/95
to
In article <40apft$3...@news.parc.xerox.com> bo...@parc.xerox.com (Hans
Boehm) writes:

[Concerning the quality, or lack thereof, of standard string
classes...]

|> I don't think you should use string manipulation where it's inappropriate.
|> But many programs (editors, mailers, news readers, etc.) naturally
|> manipulate long strings. Currently they mostly invent their own custom
|> representations, except where the authors decided not to bother. In those
|> places they effectively break on unexpectedly long input. Why not use
|> a robust representation as the standard, and then optimize in those
|> few places where it doesn't perform adequately, and it's safe to
|> optimize?

I've considered this a bit, too. I know of several ways of
implementing a string class that are significantly faster than the
classical solution for specific operations.

The problem is, that the operations which need to be optimized are not
the same for different applications. A compiler will need different
string handling than an editor, for example.

With this in mind, I'm not really convinced that the solution is to
try and create an optimal string class as a standard. I rather think
of the standard string class as a facility for people like myself,
whose programs only do string handling secondarily (formatting error
messages, and the like). If I were writing an editor, for example, I
would not expect the standard string class to be acceptable for my
text buffers. In this regard, just about any string class with the
required functionality will do the trick. (And it is more important
that the string class be easier to use than that it be fast.)

This does mean that most text oriented applications will have to
`re-invent the wheel', in that they will have to write their own
string class. But I'm not convinced that there is a string class
which would be appropriate for all applications, anyway.

Stefan Monnier

unread,
Aug 10, 1995, 3:00:00 AM8/10/95
to
In article <40ceal$o...@miranda.gmrc.gecm.com>,
Paul Johnson <p...@gmrc.gecm.com> wrote:
] How about looking at the call stack? You can identify the allocate

] call by looking at your return address. Then localise objects
] according to that. You could extend this scheme by hashing the last
] few return addresses together to get some idea of the context.

ignoring the fact that looking at the stack is not portable, there is another
problem: how to you plan on making the thing fast enough ?
(improving locality is good, but you don't want your allocator to take ten times
as much time, do you?)


Stefan

David Boles

unread,
Aug 10, 1995, 3:00:00 AM8/10/95
to
In article <807948...@wildcard.demon.co.uk>,

Cyber Surfer <cyber_...@wildcard.demon.co.uk> wrote:
>In article <jyw...@bmtech.demon.co.uk>
> sco...@bmtech.demon.co.uk "Scott Wheeler" writes:
>> addition to the draft, but since I don't know of a PC compiler with the
>> extensions to the template functionality required to support the STL,
>> this isn't a major loss.
>
>Yes, the lack of such a compiler is a major pain. It's made worse
>by a lack of _choice_ of compiler, but that's another matter. I can
>at least choose which books I read. ;-)

I'd like to point out that if you had a choice, such a compiler does
indeed exist. In fact it was the PC reference compiler for the STL
implementation: IBM's CSet++ 2.1 or 3.0. I use the public
implementation of STL with it and have had no problems so far. It is
an *extremely* high quality compiler that is very well supported.
If you have to use C++ on a PC, there is no better choice.

Cheers,

Dave Boles
dbo...@convex.com

Cyber Surfer

unread,
Aug 10, 1995, 3:00:00 AM8/10/95
to
In article <jyw...@bmtech.demon.co.uk>
sco...@bmtech.demon.co.uk "Scott Wheeler" writes:

> "The Draft Standard C++ Library", PJ Plauger, Prentice Hall,
> 0-13-117003-1

Excellent, thanks. I have some book tokens to use up, so that'll help!

Hans Boehm

unread,
Aug 10, 1995, 3:00:00 AM8/10/95
to
wa...@cup.hp.com (Thomas Wang) writes:

>Precisely. No relocation/compaction is likely to have the best performance
>for this reason. Pathological fragmentation is still a problem, which need
>to be addressed. Interestingly enough, present day VM hardware is already
>sufficient to minimize the fragmentation problem through NOT MAPPING the
>free fragments. So the true problem lies in OS software, which provides no
>API to portably access this feature. What needs to be done is for the
>researchers to present a standard package of API to various OS standard bodies
>to make sure wide adoption of various VM access functions.

>In one sample implementation, I imagine a non-moving incremental garbage
>collector with a software write barrier. This collector uses a binary buddy
>system memory allocator, which minimizes external fragmentation. Internal
>fragmentation is minimized through VM hardware of not mapping the free
>fragments. Sub-page fragmentation is minimized by grouping small objects
>into pools. This collector would have good interactive performance, and
>should be cache friendly as well.

I'm all in favor of portable APIs to access the VM system. But it's
important to get the details right. I think you want a call that marks
certain pages "disposable", i.e. the operating system may replace them
by zero-filled pages AT ITS DISCRETION. An mmap/munmap style interface
that gives the client explicit control over what's mapped doesn't seem
quite right. In particular, remapping a previously unmapped page is
unavoidably expensive, even if there was no demand for the physical page.
Security issues require that the page be cleared or somehow overwritten
every time you do that. (This is another reason I'm not convinced that
malloc implementations should use munmap to return space to the OS.)

The right primitve would generally allow an OS not to page out free space.
It is thus useful in many contexts. But I'm not sure that the desire to
use binary buddy allocation is particularly convincing. Since this allocator
presumably only works on page-sized or larger chunks, its running time
is likely to be in the noise compared to object initialization time. Thus
you might as well use something that makes more efficient use of the
address space (e.g. some clever best fit implementation).

The other issue that needs to be handled somehow is what happens when you
run out of swap space. I would want either AIX-style "low swap space"
warning signals, or an explicit call, which may fail, to reclaim a
"disposable" page. A lot of real software at least tries to handle running
out of memory gracefully. It succeeds enough of the time that you have to
continue to give it a chance.

I'd also love to see a cheap way to retrieve "page dirty" information
from the OS. A software write barrier isn't always practical (e.g.
when dealing with third-party libraries.) And for some algorithms
dirty bits suffice as a VM write-barrier. And they're
typically cheap to get from the hardware.

Henry Baker

unread,
Aug 10, 1995, 3:00:00 AM8/10/95
to
In article <40dfmv$l...@news.parc.xerox.com>, bo...@parc.xerox.com (Hans
Boehm) wrote:

> I'm all in favor of portable APIs to access the VM system. But it's
> important to get the details right. I think you want a call that marks
> certain pages "disposable", i.e. the operating system may replace them
> by zero-filled pages AT ITS DISCRETION. An mmap/munmap style interface
> that gives the client explicit control over what's mapped doesn't seem
> quite right. In particular, remapping a previously unmapped page is
> unavoidably expensive, even if there was no demand for the physical page.
> Security issues require that the page be cleared or somehow overwritten
> every time you do that. (This is another reason I'm not convinced that
> malloc implementations should use munmap to return space to the OS.)

As many VM's and/or disks are already starting to compress pages when
swapping them out, the cost of zero'ing is no larger than the cost of
compressing, and compressing a page of zeros is moderately easy. So
your concerns about the cost of zeroing out pages is probably misplaced.

If forced to live with an interface, I'd be happy with the following
simple interface:

The OS notices (either by compression or other means) that a page is
all 0's. In this case, it doesn't bother writing the page to disk,
and free's the disk page as well. (If you're still concerned about
security, then you're going to have to rewrite the page, or start
leaning on your disk vendors to provide destructive reads a la
linear logic.) Later, if the page is referred to again, the VM
can reconstitute it, again without any disk I/O. The same trick
will also work for caches.

The OS should also provide a call to explicitly zero a region of memory.
The first and last pages of the region can be zeroed explicitly, with
the remainder of the pages being zero implicitly by declaring them to
be 0's in the page table. This is an optimization, since the user can
achieve the same effect more clumsily by zeroing the pages himself.

Finally, the OS could provide some method for 'write-allocate' in VM. One
scheme might be to keep track of a compact region of a page that has been
written to. Any reads will force a read from the disk, and a merging of
the new & old information. Another method might be to do some kind of
checkpoint of the program at the point that it starts writing to the
page, and if it ever reads before the page is fully initialized, then
the program is restarted with the 'real' page. Yet another scheme would
be to explicitly store the write-allocate bits from the cache in real
memory when the cache flushes. If the page has write permission, but
not read permission, then a trap will occur if the program attempts
to access the page.

> I'd also love to see a cheap way to retrieve "page dirty" information
> from the OS. A software write barrier isn't always practical (e.g.
> when dealing with third-party libraries.) And for some algorithms
> dirty bits suffice as a VM write-barrier. And they're
> typically cheap to get from the hardware.

I've been trying since 1974 to get "page dirty" information from the VM with
relatively little success.

Robert P. Krajewski

unread,
Aug 11, 1995, 3:00:00 AM8/11/95
to
In article <KANZE.95A...@slsvhdt.lts.sel.alcatel.de>,
ka...@lts.sel.alcatel.de (James Kanze US/ESC 60/3/141 #40763) wrote:

>The problem is, that the operations which need to be optimized are not
>the same for different applications. A compiler will need different
>string handling than an editor, for example.
>
>With this in mind, I'm not really convinced that the solution is to
>try and create an optimal string class as a standard. I rather think
>of the standard string class as a facility for people like myself,
>whose programs only do string handling secondarily (formatting error
>messages, and the like). If I were writing an editor, for example, I
>would not expect the standard string class to be acceptable for my
>text buffers. In this regard, just about any string class with the
>required functionality will do the trick. (And it is more important
>that the string class be easier to use than that it be fast.)
>
>This does mean that most text oriented applications will have to
>`re-invent the wheel', in that they will have to write their own
>string class. But I'm not convinced that there is a string class
>which would be appropriate for all applications, anyway.

OK, but what might be appropriate is that there's a standard string class
*interface*, so that even implementors of specialized string classes can
use standardized routines for the things they need, but don't need to be
optimized. Such a standard also allows them to pass such objects to
external libraries that expect a standard interface for strings.
Otherwise, there's a going to be lot of reimplementation going on. (As if
there isn't already.)

I designed a string class that was actually two string classes. The first
class was SimpleString. All it assumed was the existence of a pointer to
characters, C style (this is C++), and a length count. Even with this
constraint, which rules out the possibility of growing a string or
allocating storage later in its lifetime, the SimpleString class could do
a lot of useful work -- counting spacing characters, searching, copying,
conversion to various character sets, and even in-place modifications that
don't require reallocation in any case, such as case (lower/upper)
conversion.

By breaking the assumption of a certain kind of allocation, it was
possible to derive from this class to take care of differing situations.
First, there was the general string class that allocated character storage
from the heap and pretty much rounded out the set of the operations that
were destructive or involved side-effects. But you could also derive
classes that always expected that the storage would be managed for them,
especially in the very common case where a string's length would never
increase once it was created -- the character storage might come the
stack, or might have been cleverly arranged to come from the same heap
block as the heap allocation of the string object itself (i.e., allocate a
block the size of the string descriptor, plus the storage needed for
characters). So even "clever" strings could be passed to the simpler, more
general operators that didn't care about cleverness.

Sam Kendall

unread,
Aug 11, 1995, 3:00:00 AM8/11/95
to
ka...@lts.sel.alcatel.de (James Kanze US/ESC 60/3/141 #40763) wrote:

>Accessing contents with an index > 0 is undefined behavior;
>with any implementation of reasonable quality, it will cause a bounds
>check error. (Of course, I've never seen an implementation of C with
>reasonable quality, so you may be safe. Although I seem to have heard
>somewhere that there is one, Centerline, maybe?)

That's right; CenterLine has a C/C++ interpreter that checks all sorts of
things, including array and pointer bounds. There are other products,
notably Purify and similar things, that catch some but not all bounds
violations in these awful, unsafe languages.

My fantasy: what if 10% of the effort spent in developing these C/C++
error-checking tools (I've written major chunks of several of them) were spent
switching programmers over to a good, safe language like Modula-3 or Dylan?
Ah, but on these newsgroups I'm mostly preaching to the converted.

--Sam Kendall


Cyber Surfer

unread,
Aug 11, 1995, 3:00:00 AM8/11/95
to
In article <40djgo$f...@zeppelin.convex.com>
dbo...@news.eng.convex.com "David Boles" writes:

> I'd like to point out that if you had a choice, such a compiler does
> indeed exist. In fact it was the PC reference compiler for the STL
> implementation: IBM's CSet++ 2.1 or 3.0. I use the public
> implementation of STL with it and have had no problems so far. It is
> an *extremely* high quality compiler that is very well supported.
> If you have to use C++ on a PC, there is no better choice.

I'll see what I can do, but at the moment I don't have any choice
over which C/C++ compiler I use. I'm currently using VC++ 1.5 and/or
SC++ 6.1, depending on which machine I use, but an upgrade may be
possible later this year.

Does CSet++ support the Win16 and Win32 platforms?

Thanks.

Erik Naggum

unread,
Aug 11, 1995, 3:00:00 AM8/11/95
to
[Stefan Monnier]

| ignoring the fact that looking at the stack is not portable,

memory allocation is never portable. I just realized that this may be why
C++ is so bad in this area, and why C++ programmers write their own
allocators all the time. non-portable things should be isolated and have
geniuses optimize the hell out of them.

| there is another problem: how to you plan on making the thing fast
| enough ? (improving locality is good, but you don't want your
| allocator to take ten times as much time, do you?)

someone said that one page fault costs on the scale of 10,000,000
instructions on today's CPU, memory, and disk speeds. if you avoid a page
fault, you have time to spare. later, you can optimize the savings...

here's one statistic: I have helped a math grad student write a C program
that does some horrible analysis on millions of points in fluid mechanics.
she insisted upon an access pattern that trashed the machine she ran it on,
so I wrote a custom matrix allocator that matched her access pattern, and
runtime fell with 70%. bonus: this allocator is also *much* faster than
your regular 5000 malloc calls to set up the indirection vector into the
matrix, since it allocates all of the memory in one fell swoop. it could
of course be greatly improved with a language that let me specify how to
access the memory, but worrying about the allocator does pay off at times.

#<Erik 3017137801>
--
#<Xyzzy 202B370B73>

mfi...@inmind.com

unread,
Aug 12, 1995, 3:00:00 AM8/12/95
to
In <808121...@wildcard.demon.co.uk>, Cyber Surfer <cyber_...@wildcard.demon.co.uk> writes:
>
>Does CSet++ support the Win16 and Win32 platforms?
>

Currently, C/Set++ (renamed to VisualAge C++ with version 3.0) supports OS/2,
AIX and a couple of other platforms. However, the NT port is currently in
progresss (I have the first NT beta) and will be released sometime around the
first of next year. You can probably get on the beta, the OS/2 beta had two
public releases (the current NT beta is private) that you could pick up for $25.
Even the beta were far better compilers than most released commercial
compilers -- especially the Microsoft compilers which are (in my opinion) basically
junk.

However, the C/Set++ compilers are ALL 32-bit and will not be supporting 16-bit
code. They will, however, LINK to 16-bit code via "thunks". The NT compiler
will run on and support both NT and Win95.

Michael Lee Finney -- no relation to IBM other than testing of betas.


Paul Wilson

unread,
Aug 12, 1995, 3:00:00 AM8/12/95
to
In article <kelvin.807660304@kickapoo>,

Kelvin Nilsen <kel...@cs.iastate.edu> wrote:
>p...@cs.umass.edu ( Robin Popplestone ) writes:
>
>>> In article <3vnghm$s...@info.epfl.ch>,
>>> Stefan Monnier <stefan....@epfl.ch> wrote:
>>>In article <hbaker-3107...@192.0.2.1>,
>>>Henry Baker <hba...@netcom.com> wrote:
>>>] IMHO the major gain from compaction is to defragment the address space,
>>>] but as Paul's recent work has shown, one may not have to do this very
>>>] often.
>
>>And Paul Wilson writes in <3vnvr1$9...@jive.cs.utexas.edu>
>
>>> Right. For "most" programs (that have been studied in enough detail, that
>>> is), it looks like you never have to do it at all. Nobody's studied
>>> really long-running programs, though, and it's not clear whether
>>> fragmentation will accumulate over time, or stabilize at an acceptable
>>> level.
>
>While many individual programs tend to exhibit the behavior that Paul
>describes, I would certainly not want to gamble the future of automatic
>garbage collection on overgeneralization of this empirical observation.

I wouldn't either. I'm not arguing that compaction is not a good thing;
in the best of all possible worlds most people would use languages and
compilers and hardware that are friendly to copying garbage collection.

I *will* argue, though, that the importance of compaction appears to
be widely overestimated, and that there's some data that suggests that
most programs are ``well behaved'' with respect to fragmentation, if
you pick a good allocator. Unfortunately, this appears to be true for
most of the data I've seen (and gathered) but there doesn't seem to be
nearly enough data to know how true it is.

I also think that some of the exceptional cases (where fragmentation is
problematic) can be dealt with reasonably well by hacking implementations
of data structures that respect granularity issues.

>There are a number of reasons why it would be wise to hedge your bets:
>
>One reason that defragmentation is often not necessary is because
>many of the programs that have been studied empirically allocate
>a large proportion of their objects from a small number of
>predetermined size classes. This is not necessarily representative
>of the "typical" application of the future:
>
> a) The tendency to allocate objects in predetermined sizes is
> in part a consequence of certain programming language
> implementation designs (e.g. Lisp uses dotted pairs to
> represent "everything") and partly the influence of static
> programming language tunnel vision. One of the indirect benefits
> of garbage collection is that it frees programmers from these
> restrictive mindsets. Cdr-coded Lisp lists are likely to
> vary greatly in length. Does anyone have statistics?

My impression is that CDR-coding is dead, and I'll bet it stays dead.
It's just too slow on stock hardware, and it's not clear it's really worth
it on standard hardware.

Compressed paging is more general, and easier to support with reasonable
efficiency. It also has the potential to hide a multitude of data
representation sins, which is good---you can use fast "horizontal"
representations, and compress them down to something more nearly optimal
in the VM (or even cache) system(s). (See the paper "Operating System
Support for Small Objects in our ftp repository mentioned in my .sig.)

> In Icon, efficiency of string operations was a major design
> objective. Since the implementation of strings is so efficient,
> Icon programmers feel very comfortable creating strings ranging
> in length from several bytes to tens of kilobytes. Some
> real-world programs even create strings that exceed a hundred
> kilobytes. Icon's implementations of lists and dynamic hash tables
> also benefit greatly from the ability (made possible through
> automatic compacting garbage collection) to size internal objects
> in whatever sizes are most appropriate for run-time efficiency.

I think almost all real scalable data structures allow for some variation
in the granule size, so that library implementors can pick sizes that
don't tend to cause a lot of obnoxious fragmentation. This is not as nice
as having fragmentation automatically "go away" due to compaction, but it'll
do in a pinch, and I think non-compacting memory management will generally
have a slight effciency edge---or a large one, for large objects.

> b) Consider the implementation of digital TV studio software. The
> "programmer" is likely to need quick access to a large number of
> digital audio and video clips. The lengths of these clips will
> be highly unpredictable. I can envision the day when these
> audio-visual clips would be as easily manipulated as Icon strings
> are manipulated today.

I'm not sure I follow this. I suspect that scaleable data structures are
a good idea in both cases, and that in both cases the fragmentation problem
can be minimized by the appropriate granularity choices in implementing
the data structures. The fact that strings and copressed audio or video
don't generally require contiguous storage means that a fixed chunk sizes
can be used without much cost in either internal or external fragmentation.

> c) Consider the implementation of a PDA system. Suppose the system
> has no disk. The PDA is required to retain its user's appointment
> calendar (with variable amounts of information representing each
> day's scheduled and logged activities), contact lists (with variable
> amounts of information associated with each contact), scratch pads
> (including spreadsheets, ASCII notes, untranslated pen scribbles,
> vector graphics, email memos sent and received, faxes sent
> and received), voice notes (annotations to the scratch-pad
> information, personal dictations, voice mail forwarded from
> the central office), executable code segments (for each of the
> supported applications), etc. Clearly the sizes of allocated
> segments are highly unpredictable. Now, add another complicating
> factor: compress all objects that are larger than a particular
> threshold size and have not been accessed within the last N
> minutes. The compression factor itself is unpredictable.

Right, but compressed data don't have to be stored contiguously. They
can be divided into fixed-sized chunks that are trivial to manage without
significant fragmentation, and strung together during uncompression to
"materialize" a contiguous object (or page) on demand. (This is not
hypothetical, by the way---it's from experience with real compressed paging.)

>In summary, the notion that "defragmentation of dynamic memory is not
>necessary" is a self-fulfilling prophecy. In the absence of compaction
>and/or in environments that provide seemingly unbounded amounts of
>virtual memory, programmers are likely to develop software in ways that
>do not depend on defragmentation, but there are real, non-trivial costs
>associated with this style of development.

True, but I don't think the costs are huge. And in the long run, I'm on
your side---when we figure out what the *real* issues are in storage
management, we should evolve towards more flexible systems that can support
more nearly ideal memory management.

On the other hand, it's not clear what fragmentation *is* costing. Given
the pathetic state of allocator research at this point, I wouldn't be
surprised if a coupla billion dollars worth of RAM is going to waste
right now due to dumb allocators. (Given the widespread use of ad hoc
and brain-dead allocators in C++, I'd be surprised if a coupla billion
dollars worth of RAM *isn't* going to waste right now.) My agenda at
the moment is to fix that problem. I suspect that if billions of dollars
are being wasted at the moment, there are some easy fixes that will fix
*most* of the problem. (See our allocator survey, also in our ftp
repository.) Then we get into the fancy stuff.

--
| Paul R. Wilson, Comp. Sci. Dept., U of Texas @ Austin (wil...@cs.utexas.edu)
| Papers on memory allocators, garbage collection, memory hierarchies,
| persistence and Scheme interpreters and compilers available via ftp from
| ftp.cs.utexas.edu, in pub/garbage or http://www.cs.utexas.edu/users/oops/)

Cyber Surfer

unread,
Aug 13, 1995, 3:00:00 AM8/13/95
to
In article <40it8r$j...@mujibur.inmind.com> mfi...@inmind.com writes:

> Currently, C/Set++ (renamed to VisualAge C++ with version 3.0) supports OS/2,
> AIX and a couple of other platforms. However, the NT port is currently in
> progresss (I have the first NT beta) and will be released sometime around the
> first of next year. You can probably get on the beta, the OS/2 beta had two
> public releases (the current NT beta is private) that you could pick up for
> $25.

Well, I'll probably be using Win95 long before I use NT, but if IBM
have yet to add Win95 support, then I can wait for it. It's _very_
unlikely that I'll be using OS/2 in the near future, but it's worth
knowing about.

> Even the beta were far better compilers than most released commercial
> compilers -- especially the Microsoft compilers which are (in my opinion)
> basically
> junk.

I agree. I try to avoid VC++, which is why I'm curious about CSet++.
Almost all my current Win16 development is done using VC++, but I'm
hoping that'll change soon.



> However, the C/Set++ compilers are ALL 32-bit and will not be supporting 16-bit
> code. They will, however, LINK to 16-bit code via "thunks". The NT compiler
> will run on and support both NT and Win95.

Ah. Well, I need Win16 _and_ Win32 support at the moment. I'm not sure
if I can use Win32s, altho I'd love to just forget about the Win16 API
and only code for Win32, whilst using Win32s for running my code on Win16.

Sadly, this is not my choice to make. If it was, then I might choose
something like Allegro CL, or Dylan. Perhaps that's _why_ it's not my
choice...



> Michael Lee Finney -- no relation to IBM other than testing of betas.

You can never beta test enough. ;-)

Many thanks.

Peter Holzer

unread,
Aug 14, 1995, 3:00:00 AM8/14/95
to
"Stefan Monnier" <stefan....@epfl.ch> writes:

>In article <40ceal$o...@miranda.gmrc.gecm.com>,
>Paul Johnson <p...@gmrc.gecm.com> wrote:
>] How about looking at the call stack? You can identify the allocate
>] call by looking at your return address. Then localise objects
>] according to that. You could extend this scheme by hashing the last
>] few return addresses together to get some idea of the context.

>ignoring the fact that looking at the stack is not portable, there is another


>problem: how to you plan on making the thing fast enough ?
>(improving locality is good, but you don't want your allocator to take ten times
>as much time, do you?)

Doesn't have to take much time. It is probably sufficient to use the
top 3 or 4 return addresses (1 almost certainly isn't enough. Too many
programs use a wrapper around malloc), each of which means 2 memory
accesses (close together, probably in the same cache line) and an
addition. Plus the overhead of the hash function of course. If it saves
1 page fault per 10000 mallocs it has repaid its cost several times.

hp

--
_ | Peter Holzer | Systemadministrator WSR | h...@wsr.ac.at
|_|_) |-------------------------------------------------------
| | | ...and it's finished! It only has to be written.
__/ | -- Karl Lehenbauer

Stefan Monnier

unread,
Aug 14, 1995, 3:00:00 AM8/14/95
to
In article <40n8j5$6...@wsrcom.wsr.ac.at>,
Peter Holzer <h...@wsrdb.wsr.ac.at> wrote:
] Doesn't have to take much time. It is probably sufficient to use the

] top 3 or 4 return addresses (1 almost certainly isn't enough. Too many
] programs use a wrapper around malloc), each of which means 2 memory
] accesses (close together, probably in the same cache line) and an
] addition. Plus the overhead of the hash function of course. If it saves
] 1 page fault per 10000 mallocs it has repaid its cost several times.

First thing, unless your program really eats up much memory, this is likely to
not save any page fault. (of course, you might disagree, based on what you
consider as a typical program, but the point still holds: several programs's
working sets fit in memory. Saving memory is never bad, but in the usual unix's
view of "one user, one workstation, one active program", the user won't notice
the improvement.

Anyway, the speed of malloc *is* important because of the impact it has on
programmers. If malloc is considered as slow (as is often the case), people will
continue to write custom routines and to avoid dynamic allocation (of course,
they also avoid it because of the burden implied by the lack of GC).
A fast malloc consists of very few cycles (basically: find the free list
corresponding to the required size and pop an element). Your 3 or 4 lookups (the
first will be trivial but the others might be more involved since they depend on
the frame size) plus hash-table access will probably mean that malloc will be at
least 3 times slower (and I consider 3 as very optimistic). It's definitely not
negligible and it definitely had better improve noticeably the locality to be
worth bothering !


Stefan

Scott Wheeler

unread,
Aug 14, 1995, 3:00:00 AM8/14/95
to
In Article <40djgo$f...@zeppelin.convex.com> David Boles writes:
>...

>I'd like to point out that if you had a choice, such a compiler does
>indeed exist. In fact it was the PC reference compiler for the STL
>implementation: IBM's CSet++ 2.1 or 3.0....

Sorry, I should have specified "no PC compiler for DOS/Windows" - which
appeared to be CS's target from the compilers he mentioned.

Scott

(Lisp newsgroups removed)

Hans Boehm

unread,
Aug 14, 1995, 3:00:00 AM8/14/95
to
"Stefan Monnier" <stefan....@epfl.ch> writes:

>In article <40ceal$o...@miranda.gmrc.gecm.com>,
>Paul Johnson <p...@gmrc.gecm.com> wrote:

>] How about looking at the call stack? ...

>... how to you plan on making the thing fast enough ? ...

This is all discussed in great detail in "Using Lifetime Predictors to Improve
Memory Allocation Performance", Barrett and Zorn, PLDI 93
(SIGPLAN Notices 28, 6, pp. 187-196). There are even some measurements
there.

Hans Boehm

unread,
Aug 14, 1995, 3:00:00 AM8/14/95
to
ka...@lts.sel.alcatel.de (James Kanze US/ESC 60/3/141 #40763) writes:

...

>With this in mind, I'm not really convinced that the solution is to
>try and create an optimal string class as a standard. I rather think
>of the standard string class as a facility for people like myself,
>whose programs only do string handling secondarily (formatting error
>messages, and the like). If I were writing an editor, for example, I
>would not expect the standard string class to be acceptable for my
>text buffers. In this regard, just about any string class with the
>required functionality will do the trick. (And it is more important
>that the string class be easier to use than that it be fast.)

>This does mean that most text oriented applications will have to
>`re-invent the wheel', in that they will have to write their own
>string class. But I'm not convinced that there is a string class
>which would be appropriate for all applications, anyway.

I agree that some applications will need their own string class.
Nontheless, I think a standard string class is important. Many libraries
will require strings as input parameters, or generate strings. If you
want any hope of having such libraries play together, they will need
a standard string representation.

I claim the characteristics you want from a standard string class are:

1) It should be easy to use, with a clean interface. It should be
general enough to allow reusability of libraries that rely on it.

2) It should be robust. The implementation should scale reasonably
to large inputs, both in that it remains correct, and that its performance
remains reasonable.

I would argue that a string class based on a conventional flat representation,
and whose interface exposes that, doesn't do very well on either criterion.
It's often difficult or impossible to pass some other string representation
(e.g. a file) to a string function without actually changing representation
(e.g. reading the whole file). A tree-based representation can, at minimal
cost, accomodate strings with user specified character access functions.

A flat representation does allow in-place string updates,
a facility I would personally rather leave out of a "standard" string class.
You still have arrays of characters when you need them.

The standard string representations don't scale to long
strings. Code that took 10 milliseconds on 100 character strings, may
take more than a minute on 10000 character strings. Programs are unlikely
to be usable for string inputs much longer than what the author
directly anticipated with a custom string class, even if machines
have gotten appreciable faster in the interim. This sort of behaviour
is exhibited by any program that builds a long string by repeated
concatenation, or repeatedly inserts characters in the middle of a long
string.

(Another, more subtle, problem with flat string representations is
that they don't interact very well with storage allocators,
especially again as strings get long. Fragmentation is much less likely
with many short string chunks than with long contiguous strings.
Even copying garbage collectors may suffer, since they will (hopefully)
avoid moving large pointerfree objects.)

Ed Osinski

unread,
Aug 14, 1995, 3:00:00 AM8/14/95
to
In article <40o1dr$3...@news.parc.xerox.com>, bo...@parc.xerox.com (Hans Boehm) writes:
|> (Another, more subtle, problem with flat string representations is
|> that they don't interact very well with storage allocators,
|> especially again as strings get long. Fragmentation is much less likely
|> with many short string chunks than with long contiguous strings.
|> Even copying garbage collectors may suffer, since they will (hopefully)
|> avoid moving large pointerfree objects.)

Aha! So this can then be used as a 'reason' why garbage collection is
not a good idea. :-)

---------------------------------------------------------------------
Ed Osinski
Computer Science Department, New York University
E-mail: osi...@cs.nyu.edu
---------------------------------------------------------------------
"No, no, no, don't tug on that. You never know what it might be attached to."
-- Buckaroo Bonzai to assistant during brain surgery

Stefan Monnier

unread,
Aug 15, 1995, 3:00:00 AM8/15/95
to
In article <40o99k$r...@cmcl2.NYU.EDU>, Ed Osinski <osi...@cs.nyu.edu> wrote:

] In article <40o1dr$3...@news.parc.xerox.com>, bo...@parc.xerox.com (Hans Boehm) writes:
] |> (Another, more subtle, problem with flat string representations is
] |> that they don't interact very well with storage allocators,
] |> especially again as strings get long. Fragmentation is much less likely
] |> with many short string chunks than with long contiguous strings.
] |> Even copying garbage collectors may suffer, since they will (hopefully)
] |> avoid moving large pointerfree objects.)
] Aha! So this can then be used as a 'reason' why garbage collection is
] not a good idea. :-)

I see the smiley, but still: he didn't say that flat strings interact poorly
with GC, but that they interact poorly with storage allocators and that even
copying GCs (which could avoid that bad interaction thanks to their compacting
behavior) don't help since they would often treat such big pointer-free objects
in a special way (they wouldn't copy them).

So what he says is that dynamically allocated flat strings are bad because of
their potential for generating fragmentation, even in the most favorable case
(a copying GC).

This doesn't look like a reason why GC is not a good idea !


Stefan

Erik Corry

unread,
Aug 15, 1995, 3:00:00 AM8/15/95
to
Hans Boehm (bo...@parc.xerox.com) wrote:

: I claim the characteristics you want from a standard string class are:

: 1) It should be easy to use, with a clean interface. It should be
: general enough to allow reusability of libraries that rely on it.

: 2) It should be robust. The implementation should scale reasonably
: to large inputs, both in that it remains correct, and that its performance
: remains reasonable.

3) You should be able to get a C-style null-terminated char array
out of it at minimal cost, because you are going to need this to deal
with just about every library written in C up until your great new
implementation came along.

This seems to me to be a major argument for having a flat representation
internally, too. If you don't have simple extraction of/conversion to
a C-style null-terminated character array, then your new string library
is not going to be used.

--
New Age: Science meets a modern Hydra
--
Erik Corry ehc...@inet.uni-c.dk

Joseph H Allen

unread,
Aug 16, 1995, 3:00:00 AM8/16/95
to
In article <DDBsK...@kroete2.freinet.de>,

You may want to look at the automatic string library in my editor joe (get
by anonymous ftp from ftp.worcester.com). This library does nothing
interesting with memory allocation, but it satisfies '3' above nicely.
Unfortunately you probably need indirection for any interesting GC. Anyway
here's the man page for it. There may be some ideas you can get for making
a more compatible string class. There's also an automatic array of strings
library for things like environment variables and file name lists.

Name
vsncpy, vsrm, sLEN, vstrunc, vsensure, vsins, vsdel, vsfill, vsset,
vsadd - Automatic string management functions

Syntax
#include <vs.h>

char *vsncpy(char *d,int off,char *s,int len);
void vsrm(char *d);
int sLEN(char *d);
char *vstrunc(char *d,int len);
char *vsensure(char *d,int len);
char *vsins(char *d,int off,int len);
char *vsdel(char *d,int off,int len);
char *vsfill(char *d,int off,int c,int len);
char *vsadd(char *d,int c);
cgar *vsset(char *d,int off,int c);

Description
This is a string library which supports strings which automatically
resize themselves when needed. The strings know their own size, so getting
the length of a string is always a fast operation and storing NULs in the
string is permissable. The strings are backward compatible with C's regular
zero terminated strings.

Each automatic string is stored in its own malloc block and has the
following format:

<bksize><length><string><zero>

'bksize' and 'length' are integers which give the size of the malloc
block and the length of the string. A zero character always follows the
string for compatibility with normal C zero-terminated strings. The zero is
not counted as part of the string length.

The strings are not addressed with 'bksize' (the beginning of the
malloc block). Instead, they are addressed at the first actual character of
the string itself. This means that an automatic string looks like a normal
C string and can be addressed with type 'char *'. Also the array access
operator '[]' works for reading and overwriting automatic strings and
automatic strings can be passed directly to UNIX operating system functions.
However, free() can not be used to dispose of automatic strings. Instead,
vsrm() must be used. Also an automatic string plus an offset is not an
automatic string, but is still a legal C language string.

Primary function
_vsncpy_ - Copy a block of characters at address 's' of length 'len'
onto the automatic string 'd' at offset 'off'. The automatic string is
expanded to handle any values of 'len' and 'off' which might be given. If
'off' is greater than the length of the string, SPACEs are placed in the
gap. If 'd' is NULL, a new string is created. If 'len' is 0, no copying or
string expansion occurs. _vsncpy_ returns the automatic string, which may
have been realloced or newly created in its operation.

_vsncpy_ is the most important automatic string function. It is
both the primary constructor of automatic strings and is also a useful
operator. It works in close conjunction with the following macros:

sc("Hello") Gives --> "Hello",sizeof("Hello")-1
sz(s) Gives --> s,zlen(s)
sv(d) Gives --> d,sLEN(d)

These macros are used to build arguments for _vsncpy_. Many
functions can be created with combinations of sc/sz/sv and vsncpy:

s=vsncpy(NULL,0,NULL,0); Create an empty automatic string

s=vsncpy(NULL,0,sc("Hello")); Create an automatic string
initialized with the string "Hello"

d=vsncpy(NULL,0,sv(s)); Duplicate an automatic string

d=vsncpy(NULL,0,sz(s)); Convert a C string into an automatic
string

d=vsncpy(sv(d),sv(s)); Append automatic string s onto d

d=vsncpy(sv(d),sc(".c")); Append a ".c" extension to d.

d=vsncpy(d,0,sc("Hello")); Copy "Hello" to the beginning of d.
The original length of d is
unchanged, unless it had to be
expanded to fit "Hello".

Other functions

_vsrm_ is used to free an automatic string. If NULL is passed to
it, nothing happens.

_sLEN_ returns the length of an automatic string. If the string is
NULL, sLEN returns 0.

_vstrunc_ sets the length of an automatic string. The string is
created if NULL is passed to it. The string will be padded with spaces if
its length is increased. Vstrunc may reallocate the string if (and only if)
it is expanded, so the return value must not be ignored.

_vsensure_ reallocs the malloc block of the given string so that the
string can be later expanded to the specified length without any calls to
realloc.

_vsins_ inserts a gap into a string. If the string is NULL it is
created. If the specified offset is past the end of the string, the string
is extended.

_vsdel_ deletes a section of the string. It does nothing if the
specified offset is past the end of the string.

_vsfill_ fills a portion of a string to the specified character.

_vsadd_ appends a single character to the end of the string. A new
string is created if the specified string was NULL. This function is very
useful for loops which create strings by appending some source of
characters.

_vsset_ sets a character at a specified offset. A new string is
created if the specified string was NULL. The string is filled with SPACEs
if the specified offset is past the end of the string.
--
/* jha...@world.std.com (192.74.137.5) */ /* Joseph H. Allen */
int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0)
+r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p<1659?79:0:p>158?-79:0,q?!a[p+q*2
]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);}

It is loading more messages.
0 new messages