More thougths on DOD

Leopold Toetsch

unread,

Jan 6, 2003, 7:30:47 AM1/6/03

to P6I

Attached test program shows some additional effects of PMC size and
timing. A PMC is 32 byte, a SPMC is 16 byte, matching current and
minimal PMC sizes for i386 (or a typical 32 bit system).

1) for linear access half sized PMCs give double mark speed. This is
in good relation to stress.pasm

2) As we have only one free_list per pool, memory usage will get more
and more random, when e.g. parts of arrays are rewritten with new
values, which come from the free_list. The worst case behaviour will be
total random access to our pools data.

For PMCs, the worst case random access takes about double the time of
a linear access. SPMCs take almost 3 times, but are still faster then
PMCs. But the advantage of smaller PMCs is almost gone.

3) Conclusion
To keep PMCs more tightly together when reusing them, these numbers
seem to indicate, that we might need a free_list per pool->arena,
which would need a pool->arena pointer in the PMC.

4) With a pool->arena pointer in the PMC, we could also try to use a
spearate flags field, which could be 1 byte per PMC sized.

5) On my system pools of different sizes are in totally different
ranges of memory (sbrk vs. mmap). This makes the quick bit mask test
in trace_mem_block totally unusable.

Comments, further experiments with 4) and so on welcome,

leo

tmpc.c.txt

James Michael Dupont

unread,

Jan 6, 2003, 7:53:01 AM1/6/03

to Leopold Toetsch, P6I

first of all :
to run this, it needs a -I.

second of all, the test has errors
the Perl6grammar.pm did not return a true value at (eval 58) line 3.
can be ignored, because via some magic, it is regenerated.

I get the following error
>>WHOA! Somehow you got a different number of results than tests ran!
>>This should never happen! Please contact the author immediately!
>>END failed--call queue aborted.

test results :
-----------------------------------------------------------------------
mdupont@PI
/cygdrive/c/development/testing/parrot/parrot/languages/perl6
$ perl -I. ./perl6 --test
Perl6grammar.pm did not return a true value at (eval 58) line 3.

Test details:
t/builtins/array.t........2/3
t/builtins/string.t.......4/4
t/compiler/1.t..........14/14
t/compiler/2.t............5/5
t/compiler/3.t............7/7
t/compiler/4.t............3/3
t/compiler/5.t............5/5
t/compiler/6.t............6/6
t/compiler/7.t............1/1
t/compiler/8.t............6/6
t/compiler/9.t............4/4
t/compiler/a.t............3/3
t/compiler/b.t............6/6
t/compiler/c.t............6/6
t/compiler/qsort.t........1/1
t/rx/basic.t..............6/6
t/rx/call.t...............2/2
t/rx/special.t............2/2

Test summary:
t/builtins/array.tesok, 2/3 skipped: various reasons
t/builtins/string.tesok
t/compiler/1.tes....ok
t/compiler/2.tes....ok
t/compiler/3.tes....ok
t/compiler/4.tes....ok
t/compiler/5.tes....ok
t/compiler/6.tes....ok
t/compiler/7.tes....ok
t/compiler/8.tes....ok
t/compiler/9.tes....ok
t/compiler/a.tes....ok
t/compiler/b.tes....ok
t/compiler/c.tes....ok
t/compiler/qsort.tesok
t/rx/basic.tes......ok
t/rx/call.tes.......ok
t/rx/special.tes....ok
All tests successful, 2 subtests skipped.
Files=18, Tests=84, 3 wallclock secs ( 0.79 cusr + 1.03 csys = 1.82
CPU)
WHOA! Somehow you got a different number of results than tests ran!
This should never happen! Please contact the author immediately!
END failed--call queue aborted.

=====
James Michael DuPont
http://introspector.sourceforge.net/

__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

Dan Sugalski

unread,

Jan 6, 2003, 3:14:33 PM1/6/03

to Leopold Toetsch, P6I

At 1:30 PM +0100 1/6/03, Leopold Toetsch wrote:
>Attached test program shows some additional effects of PMC size and
>timing. A PMC is 32 byte, a SPMC is 16 byte, matching current and
>minimal PMC sizes for i386 (or a typical 32 bit system).
>
>1) for linear access half sized PMCs give double mark speed. This is
>in good relation to stress.pasm

Not at all surprising, as it doubles the cache density. OTOH, this
also argues for moving the mark stuff out of the PMC struct entirely
if we can manage it. A region of the PMC pool that's entirely mark
area would up the cache density by a factor of three or four.

>2) As we have only one free_list per pool, memory usage will get more
>and more random, when e.g. parts of arrays are rewritten with new
>values, which come from the free_list. The worst case behaviour will be
>total random access to our pools data.

This is a very good point. It might be worth having an "allocate X
PMCs into buffer Y" call that could potentially let us do more
optimization here, since if we're allocating a full pool's worth of
PMCs it may be more prudent to just allocate a full new pool and
stuff them all into the single array.

>For PMCs, the worst case random access takes about double the time of
>a linear access. SPMCs take almost 3 times, but are still faster then
>PMCs. But the advantage of smaller PMCs is almost gone.
>
>3) Conclusion
>To keep PMCs more tightly together when reusing them, these numbers
>seem to indicate, that we might need a free_list per pool->arena,
>which would need a pool->arena pointer in the PMC.

We could avoid this with properly aligned PMC pools, if we don't mind
playing pointer masking games. Whether this would be faster is an
open question, though.

>4) With a pool->arena pointer in the PMC, we could also try to use a
>spearate flags field, which could be 1 byte per PMC sized.

I don't think a single byte's enough for PMC flags, though it could
be for internal use. (Heck, two bits may be enough, in which case it
may be worth having a separate mark bitstring for live/dead marks)

>5) On my system pools of different sizes are in totally different
>ranges of memory (sbrk vs. mmap). This makes the quick bit mask test
>in trace_mem_block totally unusable.

Yeah, but I'm not sure that is a problem as such--we're going to have
to deal with multiple memory blocks anyway, so...
--
Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk

James Michael Dupont

unread,

Jan 6, 2003, 4:56:09 PM1/6/03

to Dan Sugalski, Leopold Toetsch, P6I

Dear fellow hackers,

the "C" disassembler is crashing :

mdupont@introspector:~/development/parrot/parrot$ ./disassemble
t/pmc/sub_1.pbc
disassemble: debug.c:1055: PDB_disassemble_op: Assertion `size + 20 <
space' failed.
Aborted

And the perl dissambler works !

I have been looking into and it seems that the parrot contains very
little meta-data compared to IL/DOTGNU

the table at the top of the program contains many strings, pmc_* and
sometimes other slots. But i wonder if you will try and represent OO
classes like IL does with the IL .class attribute?

best regards,

mike

Leopold Toetsch

unread,

Jan 6, 2003, 5:01:09 PM1/6/03

to Dan Sugalski, P6I

Dan Sugalski wrote:

> At 1:30 PM +0100 1/6/03, Leopold Toetsch wrote:

>> 1) for linear access half sized PMCs give double mark speed. This is
>> in good relation to stress.pasm

> ... A region of the PMC pool that's entirely mark area would

> up the cache density by a factor of three or four.

s/three of four/32/ for current PMC. next_for_gc is not really used that
often IMHO, that it should go in the same place.

Marking current PObj headers needs these flags:
live, on_free_list, is_PMC, is_PMC_ptr, is_Buffer_ptr, active_destroy
(and ev. is_SPMC i.e. scalar or small PMC). This would fit fine in one
byte. Only when we have a more complicated structure (with data,
properties, whatever) then we need next_for_gc.

>> ... The worst case behaviour will be

>> total random access to our pools data.

> This is a very good point. It might be worth having an "allocate X PMCs
> into buffer Y" call that could potentially let us do more optimization
> here, since if we're allocating a full pool's worth of PMCs it may be
> more prudent to just allocate a full new pool and stuff them all into
> the single array.

Yes. But will a HL be able to emit such a call? This would need thorough
loop analysis or some help of the HL user to declare e.g. array size in
advance, which is rather unperlish for example. But OTOH the doc could
state, if you know you'll need a really big array, then say so to perl6,
you'll get more speed then.

>> ... we might need a free_list per pool->arena,

>> which would need a pool->arena pointer in the PMC.

> We could avoid this with properly aligned PMC pools, if we don't mind
> playing pointer masking games. Whether this would be faster is an open
> question, though.

Aligned PMC pools don't help for above case of longrunning programs,
when newly created PMCs do come from different pool->arena areas, they
will be scattered all over the whole allocated regions.

The idea above is to have a free_list per header_pool->arena, which
would keep items tighter together then our free_list per header pool does.

Or: At some point it would be cheaper to forget freed PMCs and allocate
fresh new pools, cache-friendly and tightly put together.

>> 4) With a pool->arena pointer in the PMC, we could also try to use a
>> spearate flags field, which could be 1 byte per PMC sized.

> I don't think a single byte's enough for PMC flags, though it could be
> for internal use. (Heck, two bits may be enough, in which case it may be
> worth having a separate mark bitstring for live/dead marks)

S. above and pobject_lives() in dod.c. We need 1 byte per header for
marking. Not for all PObj flags, these will still exist.
The bit arithmetic will be faster for big programs ("big" will be
defined later ;-)

>> 5) On my system pools of different sizes are in totally different
>> ranges of memory (sbrk vs. mmap). This makes the quick bit mask test
>> in trace_mem_block totally unusable.

> Yeah, but I'm not sure that is a problem as such--we're going to have to
> deal with multiple memory blocks anyway, so...

Or skip this alltogether ;-)

leo

Leopold Toetsch

unread,

Jan 7, 2003, 5:05:36 AM1/7/03

to mcha...@vendian.org, P6I

Mitchell N Charity wrote:

> The attached patch adds a scheme where:
> - gc flags are in the pool, and
> - pmc->pool mapping is done with aligned pools and pmc pointer masking.

Thanks for that.

> Observations:
> - It's fast. (The _test_ is anyway.) Perhaps 4x random, 10x+ linear.

I see ~8x/~12x with pmc->pool calculation on my Athlon with -O3.

> - PMC size doesn't matter here, because we never actually touch them.

Yep, for marking. Putting the PMC on the free_list then would still be
cheaper for smaller PMCs I think.

> Mitchell
> (I assume at some point Dan will say "ok, we've demonstrated the
> minimally required gc performace... so now let's get back to actually
> writing the thing...". That will quiet my currently fluttering
> premature optimization warning flags. ;)

I think, we should have some schemes in parallel. I have here my "small
PMC" patch (though w/o morphing or such), separated flags shouldn't be
too hard due to the flag accessor macros.

Do we already have a general purpose memalign() function and a config
test for it - or better, was there some in past?

leo

Leopold Toetsch

unread,

Jan 8, 2003, 9:00:38 AM1/8/03

to mcha...@vendian.org, P6I, Dan Sugalski

Mitchell N Charity wrote:

> The attached patch adds a scheme where:
> - gc flags are in the pool, and
> - pmc->pool mapping is done with aligned pools and pmc pointer masking.
>

> Observations:
> - It's fast. (The _test_ is anyway.)

I did try it and some more in realiter.

Summary: its slower :-(
Calculating the flags position in the pool in pobject_lives() and
free_unused_pobjects() takes more time then the smaller cache foot_print
does gain. Two reasons: positions have to be calced twice and cache is
more stressed with other things, IMHO.

There seems to be remaining only: smaller PMCs for scalars.

leo

Mitchell N Charity

unread,

Jan 8, 2003, 1:48:03 PM1/8/03

to Leopold Toetsch, P6I, Dan Sugalski

Summary: its slower :-(

:(

Calculating the flags position in the pool in pobject_lives() and
free_unused_pobjects() takes more time then the smaller cache foot_print
does gain. Two reasons: positions have to be calced twice and cache is
more stressed with other things, IMHO.

Hmm... the first reason, a second bit of pointer arithmetic, seems
surprising, cycles being sooo much cheaper than cache misses. So I
modified the tpmc test with a second calc. Plus two extra function
calls to make sure it wasn't optimized away (to a separately compiled
file and back). The two real test cases (linear flag-only walk, and
random PMC->flag) were fine (unchanged and perhaps 1/3 slower), though
the fast toy case of linear PMC->flag was 5x slower (still faster than
the equivalents). So it's not the first reason.

That leaves the cache being stressed by other things.
Do we have any candidates?

I'd expect some interference effects between flag arrays, given _lots_
of arrays and random access. I'm not sure the stressX benchmarks are
"lots" enough. But while this interference might be worse in reality
than in the test program, it should still be much less than for
touching PMCs (say by 10x). So that doesn't seem a likely candidate.

Is the gc run doing anything memory intensive aside from the flag fiddling?
I don't suppose it is still touching the PMC bodies for any reason?

Puzzled,
Mitchell
("[..] in realiter"?)

Message-ID: <3E1C2F06...@toetsch.at>
Date: Wed, 08 Jan 2003 15:00:38 +0100
From: Leopold Toetsch <l...@toetsch.at>
To: mcha...@vendian.org
Cc: P6I <perl6-i...@perl.org>, Dan Sugalski <d...@sidhe.org>
Subject: Re: More thougths on DOD
References: <3E1976F7...@toetsch.at>
<200301062315...@vendian.org>

Dan Sugalski

unread,

Jan 8, 2003, 1:53:14 PM1/8/03

to mcha...@vendian.org, Leopold Toetsch, P6I

At 6:15 PM -0500 1/6/03, Mitchell N Charity wrote:
>+ pool_pmc[i] = memalign(ALIGN, SIZE*sizeof(PMC));

This is the only problem--memalign's not universal unless we build
with the malloc we provide.

Have we looked into whether we can mix this malloc with the current
memory allocation system?

Leopold Toetsch

unread,

Jan 8, 2003, 6:45:00 PM1/8/03

to mcha...@vendian.org, P6I, Dan Sugalski

Mitchell N Charity wrote:

> Summary: its slower :-(
>
> :(

Yep

> Calculating the flags position in the pool in pobject_lives() and
> free_unused_pobjects() takes more time then the smaller cache foot_print
> does gain. Two reasons: positions have to be calced twice and cache is
> more stressed with other things, IMHO.
>
> Hmm... the first reason, a second bit of pointer arithmetic, seems
> surprising, cycles being sooo much cheaper than cache misses.

Here are the relevant bits:

# define pool(o) ((struct Small_Object_Arena *) (PTR2UINTVAL(o) &
POOL_MASK))
# define DOD_FLAGS(o) \
((POOL_FLAG_TYPE *)pool(o)->flags) \
[((char*)(o) - (char*)(pool(o)->start_objects)) /
pool(o)->object_size]

(object_size is copied from pool, not currently there)

This is a general version that plugs in as a replacement for
PObj_get_FLAGS(o), but it was called only once per function. I think the
real problems are here not the cycles of pointer arithmethic, there are
different problems:
- we can't use explicit pool pointers, handling flags directly is faster
(getting the pool pointer has the same cache impact)
- when there are no explicit pool pointers, something like above has to
calulcate the pool position, which needs a fixed sized POOL_MASK i.e the
pool size.
- with fixed sized pools (buffer & PMCs) all alike, a List, List_chunk,
Hash, String and so on, get all the same pool size, though they may be
used just once, leading to huge buffer and bufferlike pools too.
- e.g. stress.pasm needs 500K PMCs, fastest is to grow pools huge to
some Megs of mem or finally ~200.000 PMCs per pool->arena.
- e.g. life.pasm needs per cycle only ~ 50 strings, but needs really
fast recycling of these, so the pool size should be not really bigger
then the demand (which holds for all programs).
- with fixed sized pools, I see no possibilty, to deal with these to
extreme demands.
- I did also try to not add_free all objects immediatly and reduce
arena->used, so that the free_unused_pobjects is faster, but this needs
a DOD run before. We don't know, in which header_pool is the shortage.
And, when one pool holds ~10^6 objects and other pools ~nothing, a DOD
run for allocating more for the rarely used pool is too expensive.

stress.pasm with fixed sized pools spends the time in
free_unused_pobjects() because there are too many (dead - or better
never alive) objects around.

> ... So I

> modified the tpmc test with a second calc.

The test is for one fixed sized pool with one header kind. We have pools
for objects of sizeof Buffer, List_chunk, List, hash, PMC and probably
more which may have very different header counts from 0 to 1e6 or more.
All have to be somehow equally fast. We can trade a little bit to favor
one kind of headers, but not all.
We can't allocate a fixed size huge pool arena for the worst case, all
others and memory consumption suffer.

> I don't suppose it is still touching the PMC bodies for any reason?

No. But wading through the root set, zig header pools, marking stacks
and so on, needs cache space.

> Puzzled,

So was I. Tests looked really fine.

BTW If you (or anyone) wants a patch just mail me

> Mitchell

leo