First, the resource system falls down hard when doing lots of
allocations with no deallocations. It gets exponentially slower,
which is a Bad Thing. While I'm not sure that there's a whole lot to
be done about this in the grand scheme, the current system can
certainly use some feedback mechanisms to deal with the problem for
reasonable (<100K live objects) data set sizes.
Second, Array and its subclasses need some performance thumping, as
they seem to come in at about half the speed of perl 5. I'm not sure
where the speed penalty comes in, whether it's in the class code or
in the design of the access mechanisms, but I want to find out why,
and fix it *now*. If it's a design problem, I want to know so we can
redesign before too much is dependent on the current scheme, and if
it's an implementation tuning issue that can be dealt with by anyone
with some time and an inclination to abuse the code.
I'll check in the stress programs and their perl 5 counterparts so we
can have some sort of feel for time issues on this.
--
Dan
--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk
> First, the resource system falls down hard when doing lots of
> allocations with no deallocations. It gets exponentially slower, which
> is a Bad Thing.
Your are sure, that you didn't start swapping?
I have here this test pushing X integers into a PerlArray, then clone
it, then do a sweep:
1e5
build 0.181974 seconds.
clone 0.056985 seconds.
sweep 0.082553 seconds.
5e5
build 0.834865 seconds.
clone 0.277557 seconds.
sweep 0.389210 seconds.
1e6 first time
build 1.305464 seconds.
clone 11.202339 seconds.
sweep 30.276119 seconds.
1e6 3rd time
build 1.293441 seconds.
clone 1.267864 seconds.
sweep 0.976740 seconds.
I see a linear increse but obscure decrease for the first 1e6 test where
it seems to be that swapping occurs first time (i386/linux, 800 Mhz
Athlon, 256 Meg Ram).
> Second, Array and its subclasses need some performance thumping, as they
> seem to come in at about half the speed of perl 5.
Ha, you didn't compare it before my list checkin/array rewrite ;-)
> I'll check in the stress programs and their perl 5 counterparts so we
> can have some sort of feel for time issues on this.
When those are in, I'll have a look at it. There are some things in
list.c that can be optimized. Did you compare mem usage of:
new P0, .PerlArray
set P0[1000000], 42
with perl5 too :-)
leo
Oh, absolutely. One of the joys of a laptop, it's pretty obvious when
the hard drive spins up. examples/benchmarks/stress.pasm sort of
shows the problem. I managed to wipe an earlier version, but it was
allocating arrays of arrays of 10K elements. It starts getting nasty
reasonably quickly. I checked in examples/benchmarks/stress3.pasm to
show what's going on. Interestingly, it triggers between 1 and 2
sweeps per 10K elements allocated, which is odd as the interpinfo
doesn't seem to show that we're actually out of anything. Either the
counters are wrong or we're triggering DODs when we're not out of
things, but whichever, we need to fix something.
The ever-increasing (by a factor of 4, it looks like) allocation size
needs a cap as well--after a certain point, increasing the header
allocation size is a bit much.
>I see a linear increse but obscure decrease for the first 1e6 test
>where it seems to be that swapping occurs first time (i386/linux,
>800 Mhz Athlon, 256 Meg Ram).
>
>>
>>Second, Array and its subclasses need some performance thumping, as
>>they seem to come in at about half the speed of perl 5.
>
>
>Ha, you didn't compare it before my list checkin/array rewrite ;-)
Oh, I saw them, and they were pathetic. Much less pathetic now, which
is darned good. :)
Still, we can (and have to) do better. I figure it's a good place for
someone who's looking to get into the source to fiddle with. It's
reasonably small and self-contained.
>>I'll check in the stress programs and their perl 5 counterparts so
>>we can have some sort of feel for time issues on this.
>
>When those are in, I'll have a look at it. There are some things in
>list.c that can be optimized. Did you compare mem usage of:
>
> new P0, .PerlArray
> set P0[1000000], 42
>
>with perl5 too :-)
Check out the stress tests, but not quite that extreme. :)
> At 12:52 AM +0100 1/3/03, Leopold Toetsch wrote:
>
>> Dan Sugalski wrote:
>>
>>> First, the resource system falls down hard when doing lots of
>>> allocations with no deallocations. It gets exponentially slower,
>>> which is a Bad Thing.
1) parrot compiled -O3 (classes/Makefile is br0ken, doesn't get CFLAGS
from Configure) - but unoptimized is still faster then perl5 :-)
2) I did finetune block allocation, which did cause DOD runs too often.
Memory blocks are growing too now.
3) Changed growing factor for PMCs from 4 to 1.5 and started with a
bigger initial size. Growing too aggressively causes far too many PMCs
allocated for programs with many live objects. Also I did introduce some
(arbitrary) upper limit for pool size.
>>> Second, Array and its subclasses need some performance thumping, as
>>> they seem to come in at about half the speed of perl 5.
4) classes/perlarray.pmc was totally unoptimized (cut&paste blabla).
$ time parrot stress.pbc
real 0m0.508s
user 0m0.430s
sys 0m0.080s
$ time parrot -j stress.pbc
real 0m0.420s
user 0m0.350s
sys 0m0.070s
$ time perl examples/benchmarks/stress.pl
real 0m0.643s
user 0m0.610s
sys 0m0.030s
But, this is not the original stress.pasm it is a worse one, more
similar to the perl version (you were reusing P0 for @arr1 and @arr2,
while mine does allocate new arrays leading to more live objects).
perl: 5.8.0 -Duselongdouble (which shouldn't matter here)
i386/linux, athlon 800, 256 Meg
I'll comment the changes a little bit more and
[ ] send patch
[ ] check it in?
leo
> $ time parrot stress.pbc
>
> real 0m0.508s
Sorry, due to some lack of coffee, I ran a version, where DOD was
blocked in buildarray.
Here are updated numbers:
$ time perl examples/benchmarks/stress.pl
real 0m0.786s
user 0m0.590s
sys 0m0.200s
$ time parrot -j stress.pbc
real 0m0.849s
user 0m0.780s
sys 0m0.070s
$ time parrot stress.pbc
real 0m0.953s
user 0m0.890s
sys 0m0.050s
The speed difference is due to 12 DOD runs, which take by far the most time:
time seconds seconds calls us/call us/call name
20.34 0.24 0.24 73 3287.67 3698.63 free_unused_pobjects
16.95 0.44 0.20 1332629 0.15 0.15 pobject_lives
14.41 0.61 0.17 cg_core
12.71 0.76 0.15 155 967.74 2257.78 list_mark
11.02 0.89 0.13 500078 0.26 0.56 new_pmc_header
7.63 0.98 0.09 13 6923.08 6923.08 alloc_objects
4.24 1.03 0.05 500051 0.10 0.50 list_assign
leo
[ another f'up myself ]
stress2 results:
$ time perl examples/benchmarks/stress2.pl
real 0m2.430s
user 0m2.410s
sys 0m0.020s
$ time parrot stress2.pbc
real 0m1.854s
user 0m1.710s
sys 0m0.140s
$ time parrot -j stress2.pbc
real 0m1.460s
user 0m1.370s
sys 0m0.090s
leo
[ this is Mr. f'up myself again ]
> $ time parrot -j stress.pbc
>
> real 0m0.849s
By reducing the amount of DOD runs I now have:
$ time parrot -j stress.pbc
A total of 9 DOD runs were made
real 0m0.708s
But this still could go faster:
$ parrot -j stress.pbc # w/o pmc->synchronize (-10% size)
A total of 9 DOD runs were made
real 0m0.635s
This test is with a 10% smaller PMC giving 10% increase in speed. This
seems to indicate again the necessity for a smaller Scalar PMC as
already proposed in the thread
"[CVS ci] mark5 - PMC/Buffer unification #10".
Having only 50% size for a "plain" scalar could inidicate 50% speed
increase for such programs with many live PMCs.
The problem is, we can't promote a smaller PMC to a bigger one. But what
about this strategy:
- scalars start with a small PMC (16 bytes on a 32 bit machine)
except when the HL indicates, that the var needs properties or may be
be shared between threads
- 99% of all scalars will probably nether be used differently
- when such a small scalar gets runtime properties added then:
- make a new "big" PMC and point the old PMC to the newly created one.
This should work like an unvisible reference, doing all functions on
behalf of the scalar.
- in the next DOD run, when we find such a PMC pointer, replace it with
the real PMC. As this might need a modified pobject_lives function
(digging in all places to find all occurencies) this could be a
separate step too, run only when needed.
leo
Odd. Not on my system. Might be a x86/PPC issue.
>2) I did finetune block allocation, which did cause DOD runs too
>often. Memory blocks are growing too now.
>
>3) Changed growing factor for PMCs from 4 to 1.5 and started with a
>bigger initial size. Growing too aggressively causes far too many
>PMCs allocated for programs with many live objects. Also I did
>introduce some (arbitrary) upper limit for pool size.
Keen, thanks.
> >>> Second, Array and its subclasses need some performance thumping, as
>>>> they seem to come in at about half the speed of perl 5.
>
>4) classes/perlarray.pmc was totally unoptimized (cut&paste blabla).
>
Figured as much. Not at all a problem, just an easy spot to get some
extra performance.
>But, this is not the original stress.pasm it is a worse one, more
>similar to the perl version (you were reusing P0 for @arr1 and
>@arr2, while mine does allocate new arrays leading to more live
>objects).
>
>perl: 5.8.0 -Duselongdouble (which shouldn't matter here)
>i386/linux, athlon 800, 256 Meg
>
>I'll comment the changes a little bit more and
>
>[ ] send patch
>[ ] check it in?
Oh, apply it, by all means. (Though I think I see that you already have... :)
You've a *long* way to go to get that title. (It's mine, and you
can't have it! :)
>>$ time parrot -j stress.pbc
>>
>>real 0m0.849s
>
>By reducing the amount of DOD runs I now have:
>
>$ time parrot -j stress.pbc
>A total of 9 DOD runs were made
>
>real 0m0.708s
>
>But this still could go faster:
>$ parrot -j stress.pbc # w/o pmc->synchronize (-10% size)
>A total of 9 DOD runs were made
>
>real 0m0.635s
>
>This test is with a 10% smaller PMC giving 10% increase in speed.
>This seems to indicate again the necessity for a smaller Scalar PMC
>as already proposed in the thread
>"[CVS ci] mark5 - PMC/Buffer unification #10".
>Having only 50% size for a "plain" scalar could inidicate 50% speed
>increase for such programs with many live PMCs.
Damn, I was afraid of that. On the one hand I like the speed, on the
other... :(
>The problem is, we can't promote a smaller PMC to a bigger one. But
>what about this strategy:
>- scalars start with a small PMC (16 bytes on a 32 bit machine)
> except when the HL indicates, that the var needs properties or may be
> be shared between threads
>- 99% of all scalars will probably nether be used differently
>- when such a small scalar gets runtime properties added then:
>- make a new "big" PMC and point the old PMC to the newly created one.
> This should work like an unvisible reference, doing all functions on
> behalf of the scalar.
>- in the next DOD run, when we find such a PMC pointer, replace it with
> the real PMC. As this might need a modified pobject_lives function
> (digging in all places to find all occurencies) this could be a
> separate step too, run only when needed.
I'm not willing to go so radically to start, but I did have an idea.
I think part of the extra cost is just in cache fluffiness--the sync
info just isn't being used much. I don't think that it, or the
property info, will be used most of the time. We could yank the
metadata and sync fields out of the PMC itself and put them in a
separate segment of the PMC's arena, and replace them with a single
pointer to the PMC's particular section.
It ultimately increases the size of a PMC by one pointer if we
preallocate all the property/sync pairs, but it makes the PMC itself
more dense and, if I'm calculating things right, makes it fit on an
8-byte boundary better.
> At 5:46 PM +0100 1/3/03, Leopold Toetsch wrote:
>
>> Leopold Toetsch wrote:
>>
>> [ this is Mr. f'up myself again ]
>
> You've a *long* way to go to get that title. (It's mine, and you can't
> have it! :)
Oh, sorry for misusing your prenome ;)
>> This test is with a 10% smaller PMC giving 10% increase in speed.
> Damn, I was afraid of that. On the one hand I like the speed, on the
> other... :(
[ differently sized PMCs ]
> I'm not willing to go so radically to start, but I did have an idea. I
> think part of the extra cost is just in cache fluffiness--the sync info
> just isn't being used much. I don't think that it, or the property info,
> will be used most of the time. We could yank the metadata and sync
> fields out of the PMC itself and put them in a separate segment of the
> PMC's arena, and replace them with a single pointer to the PMC's
> particular section.
A little bit more radically:
- a PMC has a PObj and a vtable (= 16 byte)
- if a PMC needs data, metadata or synchronize then
cache.pmc_val points to the "Extra PMC data" (our current PMC )
containing this fields so we do nether resize, but have for aggregates
or PMCs with properties one more indirection.
- PerlStrings move from ->data to ->cache.string_val
As programs tend to have much more scalars then aggregates and probably
more scalars w/o properties then with, this would win a lot of precious
cache mem.
> It ultimately increases the size of a PMC by one pointer if we
> preallocate all the property/sync pairs, but it makes the PMC itself
> more dense and, if I'm calculating things right, makes it fit on an
> 8-byte boundary better.
Not here: current PMC 32 byte, simple PMC 16 byte.
BTW tt.c prints these numbers ;-)
leo
> At 1:34 PM +0100 1/3/03, Leopold Toetsch wrote:
>> 1) parrot compiled -O3 (classes/Makefile is br0ken, doesn't get CFLAGS
>> from Configure) - but unoptimized is still faster then perl5 :-)
> Odd. Not on my system. Might be a x86/PPC issue.
I must have overlooked -O3 in classes/Makefile, it was there.
> Oh, apply it, by all means. (Though I think I see that you already
> have... :)
Only the list related part.
Ok. Here ist the rest.
- No stackwalk, as already sent by separate #19668, which is included
if this is a problem for $architecture, please enable
trace_system_areas in dod.c again - feedback still welcome
- a hack to reduce DOD runs by some amount, helping for this test,
where only allocations happen. If we keep it, it should be reworked
to have the stats in interpreter and "skip" has to be disabled for
explict "sweep" calls.
- allocation constants give good results on my systems which might
probably not apply to yours. In the long run some sself tuning
will be necessary, we will definitely need lower allocation
sizes, when e.g such a big allocation is gone, and we have plenty
of free PMCs - we currently do only increase headers_per_alloc.
- statistics need to be reworked, they are only valid after a DOD
run, though I update num_free_objects now immediately.
- memory_pool does increase allocation size too now
leo
Fair enough, though there's something just so bizarre about having
the cached value be a required part, and the data the cache is
supposed to be based on optional... :)
>>It ultimately increases the size of a PMC by one pointer if we
>>preallocate all the property/sync pairs, but it makes the PMC
>>itself more dense and, if I'm calculating things right, makes it
>>fit on an 8-byte boundary better.
>
>
>Not here: current PMC 32 byte, simple PMC 16 byte.
Hrm, did the math wrong. Need to go double-check that.
I think we're always going to have to walk the stack, no matter how
much I'd rather not. It's an expensive walk too, alas.
>- a hack to reduce DOD runs by some amount, helping for this test,
> where only allocations happen. If we keep it, it should be reworked
> to have the stats in interpreter and "skip" has to be disabled for
> explict "sweep" calls.
Twiddling the default numbers seems reasonable, though doing it for
this one test is silly.
>- allocation constants give good results on my systems which might
> probably not apply to yours. In the long run some sself tuning
> will be necessary, we will definitely need lower allocation
> sizes, when e.g such a big allocation is gone, and we have plenty
> of free PMCs - we currently do only increase headers_per_alloc.
Dynamic feedback on the allocation & GC system is definitely in
order. We should find a victim^Wvolunteer to do this--it could be a
very interesting project.
>- statistics need to be reworked, they are only valid after a DOD
> run, though I update num_free_objects now immediately.
Hrm. We should get at least the allocation numbers right, though I do
wonder at the impact of keeping buffer/pmc allocation counts. (Yeah,
I know, I added that code, as I thought the stats would be useful for
GC feedback)
That's an interesting idea. I'd originally wanted to yank a lot of
stuff out of band and get it in a separate segment of the PMC arena,
to make the commonly used parts of PMCs denser, but I'd given up that
idea because of the cost of figuring out where the extra bits were
held. I'm leaning more and more towards ebbedding a pointer to the
extra info into the main PMC body and taking the dereference hit on
those occasions when we actually need to get to it from the main PMC
body.
> At 10:25 PM +0100 1/3/03, Leopold Toetsch wrote:
>> As programs tend to have much more scalars then aggregates and
>> probably more scalars w/o properties then with, this would win a lot
>> of precious cache mem.
> Fair enough, though there's something just so bizarre about having the
> cached value be a required part, and the data the cache is supposed to
> be based on optional... :)
"Hysterical raisins" did Dan write.
Integers and floatvals are stored in the union, which happens to be
named "cache", they don't have data. Renaming cache to val would make it
more obvious. Strings are attached to data, cache is unused.
leo
> At 10:08 PM +0100 1/3/03, Leopold Toetsch wrote:
>
>> - No stackwalk,
> I think we're always going to have to walk the stack, no matter how much
> I'd rather not. It's an expensive walk too, alas.
This depends on. I think a mixed strategy of:
- code reordering (e.g. $1=pmc_new_noinit(); pmc->vtable->init() instead
of pmc=new_pmc() like done in setprop
- passing a **dest_ptr for e.g. string_concat
- disabling DOD (with optional DOD run befor, if header count is below
wanted header count, e.g. for clone)
- and may be some active anchoring to the root set
will do it.
>> - a hack to reduce DOD runs by some amount, helping for this test,
> Twiddling the default numbers seems reasonable, though doing it for this
> one test is silly.
Of course. But the "skip" thingy is different. It triese to detect, if
we are allocating only and skips the next DOD run for this case.
Instead of twiddling some number we need:
> Dynamic feedback on the allocation & GC system is definitely in order.
> We should find a victim^Wvolunteer to do this--it could be a very
> interesting project.
>
>> - statistics need to be reworked,
> Hrm. We should get at least the allocation numbers right, though I do
> wonder at the impact of keeping buffer/pmc allocation counts. (Yeah, I
> know, I added that code, as I thought the stats would be useful for GC
> feedback)
I did add num_free_objects in the fast path with very little impact on
performance. I'll rearrange current interpinfo interface and move it to
interpreter.c. Numbers like active_PMCs can be calculated on demand.
leo
> Leopold Toetsch wrote:
> $ time parrot -j stress.pbc
> A total of 9 DOD runs were made
>
> real 0m0.708s
>
> But this still could go faster:
> $ parrot -j stress.pbc # w/o pmc->synchronize (-10% size)
> A total of 9 DOD runs were made
>
> real 0m0.635s
$ time parrot -j stress.pbc
$ make -s && time parrot -j stress.pbc
A total of 13 DOD runs were made
real 0m0.378s
user 0m0.310s
sys 0m0.060s
This is with DISABLE_GC_DEBUG and a 16 byte small PMC for PerlInt. So it
is actually ~doubling speed with a half sized PMC.
If someone wants to further experiment on this issue, I could commit the
changes, with one #define in parrot.h for enabling this feature.
leo
It does seem a lot of time is being spent by free_unused_pobjects
walking arrays of PMCs (pool arenas) to twiddle flags.
An similar alternative might be to yank the gc flags to the head of
their arena. It looks like they might even fit in a short. So an
arena has an array of flag ints, and a corresponding array of PMCs.
The contiguous flags would speed the walk, and _might_ stick in cache
enough that the cache miss cost wouldn't just reappear in random PMC
access. One would need a fast way to get from a PMC address to its
arena address, eg page aligned arenas.
Random thought,
Mitchell Charity