If I do 32 "save"s in a row, this will certainly be slower than doing a
single "push".
If I do 1 "save", this will (hopefully) be faster than 1 "push".
How many "save"s does it take to be to be slower than one "push"?
(When writing pasm by hand, what's a reasonable cutoff?)
--
$a=24;split//,240513;s/\B/ => /for@@=qw(ac ab bc ba cb ca
);{push(@b,$a),($a-=6)^=1 for 2..$a/6x--$|;print "$@[$a%6
]\n";((6<=($a-=6))?$a+=$_[$a%6]-$a%6:($a=pop @b))&&redo;}
I would guess that about 256 million saves is the same as 8 million
pushes, but that really depends on how much space an entry on the
respective stacks take and how large your virtual address space can be
-- because the only way to make them equivalent is to call enough of
them to run out of memory and crash. :-)
Sorry, I'm being obnoxious. You fell into the same trap as I recently
did. pushx and save are not interchangeable; they operate on
completely different stacks. 'save' pushes an entry of arbitrary type
onto the user stack; pushi, pushs, and friends push onto type-specific
register frame stacks.
Somehow, this needs to be documented better, because it's quite
surprising (to me, at least.)
> The answer to this question varies from platform to platform, and I've
> only go windows to test on...
>
> If I do 32 "save"s in a row, this will certainly be slower than doing a
> single "push".
>
> If I do 1 "save", this will (hopefully) be faster than 1 "push".
Yep slightly.
> How many "save"s does it take to be to be slower than one "push"?
This really depends on the architecture, the running core and so on. But
Dan estimated a cutoff value of 3, this test program indicates a cutoff
of 2:
set I0, 1000000
time N0
lp:
pushp # or save P0, ...
popp # or restore P0, ...
dec I0
if I0, lp
time N1
sub N1, N0
print N1
print " s\n"
end
Loop only 0.02s (0.002 -j)
1 save_p + 1 restore_p 0.2s
2 save_p + 2 restore_p 0.4s
3 save_p + 3 restore_p 0.6s
1 pushp + 1 popp 0.38s
All run with the CGP core (-P switch), which is fastest here because
pushX/save/restore are not JITed.
Athlon 800, i386/linux, non optimized compile.
leo
Regarding memory:
Entries on the type-specific stacks consume no more memory than the
registers themselves.
Entries on the general stack (where save/restore goes) also have, in
addition to the saved register, an integer indicating the type of
register that's been saved, to prevent pushing an int and popping a
string.
Thus, 16 int save ops consume as much memory as one pushi, even
though the pushi is saving 32 int registers.
Regarding speed:
Each save is a (non-jit, non-inlined) function call.
Each push is a (non-jit, non-inlined) function call.
Even if there is *little* overhead for function calls, it's nonzero,
and it adds up.
> Sorry, I'm being obnoxious. You fell into the same trap as I recently
> did. pushx and save are not interchangeable; they operate on
> completely different stacks. 'save' pushes an entry of arbitrary type
> onto the user stack; pushi, pushs, and friends push onto type-specific
> register frame stacks.
>
> Somehow, this needs to be documented better, because it's quite
> surprising (to me, at least.)
Actually, I did realize that they go on completely different stacks.
What I didn't know (though I do now, thanks to Leopold's post) was their
relative speeds.
For me it was something like 2.37 push = 1 save.
Saves also have the advantage of, on some architectures (like SPARC),
not polluting the cache. Doing the pushes dirties your L1 & L2 caches
for both the source and destination, while the save doesn't, though
the registers are likely already in L2, if not L1, cache. SPARC has a
cache-bypassing memcpy, which is kind of cool. While you might think
that's a bad thing, you'd actually be incorrect.
--
Dan
--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk