I'm coding a routine and I'm using all of the registers.
I need to do a division which means I am going to
have to free up EAX and EDX. Is it possible for me
to simply MOVD them into the MMX registers without
causing any problems? I am not doing any floating
point math nor MMX operations in my routine.
But I recall that MMX registers have to be dealt
with carefully.
Thanks.
"A" <questi...@MUNGED.microcosmotalk.com> wrote in message
news:4b01dadb$0$5083$9a6e...@unlimited.newshosting.com...
you can, but you can also use SSE registers for this.
the main difference is that SSE registers don't require 'emms' and
similar...
> Thanks.
MMX is fine as temp storage, but the time taken to MOVD things back &
forth is probably very similar to saving them in a L1-cached stack
location...
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
> MMX is fine as temp storage, but the time taken to MOVD things back &
> forth is probably very similar to saving them in a L1-cached stack
> location...
That doesn't make any intuitive sense. Register-to-register
transfer is supposed to be very fast. Accessing the stack
is inherently much slower. Can you prove that it isn't?
"A" <questi...@MUNGED.microcosmotalk.com> wrote in message
news:4b040ebe$0$5121$9a6e...@unlimited.newshosting.com...
well, something can be noted here:
in the case of the stack, it is very likely to be located in the L1 cache;
the L1 cache is a small amount of very fast memory typically integrated
directly into the processor;
very often (depends on processor), the L1 cache is the same as the memory
used to implement the registers themselves, and so accessing the top of
stack, if it is cached, is similarly as fast as accessing registers.
accessing main memory is slow, but there are both the L2 and L1 caches much
closer than this.
main memory is off on the bus, and is large, but comparatively not very
fast;
the L2 cache is faster, and is often integrated directly into the processor
(as a small amount of on-chip high-speed RAM), but in some setups (such as
Slot A AFAIK, and in many older computers) the L2 cache was often on
external chips closely connected to the processor.
the L1 cache is faster still, but is fairly small (measured in small amounts
of KiB), and for things here, access is often almost as fast as registers...
http://en.wikipedia.org/wiki/L1_cache
this is partly why I also hold the position that SysV is not well designed:
its inherent complexity likely nullifies the benefits of passing arguments
in registers, since in the vast majority of cases these arguments will need
to be spilled anyways, and the design of the calling convention forces the
compiler to go through contortions (these themselves likely reducing
performance...).
similarly, it can be noted that good old 32-bit x86 often beats out 64-bit
code in terms of raw performance, even though good old cdecl tends to pass
everything on the stack (and as well as the notable register shortage, ...).
as I see it, the MS Win64 calling convention is actually better designed,
since it uses registers, but also makes good provisions for spilling, and
does things in a way that largely avoids needing to go through
contortions...
(granted, a sufficiently intelligent compiler could work around a lot of
these added complexities, but bleh...).
another thing likely to help in the Win64 case is that local calls (within
the same library) are linked directly, whereas officially for SysV,
calls/global vars/... are supposed to go through the GOT (as well as ELF64
assuming that the situation is "always" 64-bits, whereas Win64 being more
able to make use of the +-2GB window, ...).
given that x86-64 is position-independent by default, the need for the GOT
is largely optional except for external linkage (AKA: in the same places a
DLL would use the IAT). however, there is little provision made for knowing
the difference between local and global code/data (with DLL's, this is
explicit), hence it is not known until link time whether a function is local
or global, and to simply assume it local is, in effect, a violation of the
ABI.
hence, even if a little cruftier, IMO the Win64 ABI is, in general, better
designed...
the main weak point then, is MSVC's relatively stupid optimizer...
but, alas, there is not much that can really change anything, since in
Linux-land things are fairly well established, and people can sit around and
live in the delusion that ELF64 and SysV are "better" than PE32+ and Win64
if they want...
sadly, at this point there is no real way to "fairly" handle this problem
with benchmarks, and in a practical sense, the differences are likely to be
small anyways.
> > That doesn't make any intuitive sense. Register-to-register
> > transfer is supposed to be very fast. Accessing the stack
> > is inherently much slower. Can you prove that it isn't?
>
> well, something can be noted here:
> in the case of the stack, it is very likely to be located in the L1 cache=
;
Yes I know about L1, but a cache miss could incur a huge delay.
At any rate, I have found some numbers and these show PUSH/POP without
a cache miss should be faster than using XMM registers but not MMX.
See here:
Can you prove that it *is*? Try it and see. I suspect there are more
pathway issues than questions over simple reads/write performance on
many CPUs so results may vary but the top of the stack should be
single-cycle access on modern hardware.
James
"A" <questi...@MUNGED.microcosmotalk.com> wrote in message
news:4b044a91$0$4980$9a6e...@unlimited.newshosting.com...
>
> On Nov 18, 11:59=A0am, "BGB / cr88192"
> <cr88...@MUNGED.microcosmotalk.com> wrote:
>
>> > That doesn't make any intuitive sense. Register-to-register
>> > transfer is supposed to be very fast. Accessing the stack
>> > is inherently much slower. Can you prove that it isn't?
>>
>> well, something can be noted here:
>> in the case of the stack, it is very likely to be located in the L1
>> cache=
> ;
>
> Yes I know about L1, but a cache miss could incur a huge delay.
>
granted, however, this is much less likely near the top of the stack...
> At any rate, I have found some numbers and these show PUSH/POP without
> a cache miss should be faster than using XMM registers but not MMX.
> See here:
>
> http://www.agner.org/optimize/instruction_tables.pdf
>
well, it all depends...
the main cost of MMX is that it doesn't play well with the FPU.
if one can live with this, it should not be a problem...
keep in mind though that for non-leaf functions, it probably will still be
needed to save things to the stack (MMX state can't be retained across
function calls, ...).
Only once, and near the stack top can be almost always assumed
to be in L1 (hint, any call or ret will put it there) unless
flushed by block-ops.
> At any rate, I have found some numbers and these show
> PUSH/POP without a cache miss should be faster than using
> XMM registers but not MMX. See here:
>
> http://www.agner.org/optimize/instruction_tables.pdf
Always this "should be". Why? Just try it and measure!
I have, and using near stack almost always incurs ZERO
penalty vs reg-reg. Your Re-Order Buffer at work.
On degenerate cases reg-reg takes 0.4 clocks (near the
theoretical 1/3) while reg-L1 takes 1.0 . But this is only
for packed (similar instruction) cases that heavily load the
busses. A 50/50mix is 0.5 and 70/30 back to 0.4. You can
definitely slip in some stack access (lesser used variables)
without penalty.
I would not be surprised if there were cases where reg-L1
measured _faster_ than reg-reg. Strange things go on
inside the register renamer and re-order buffer.
The Intel Atom (In-Order like i486), is more ameanable
to simple clock-counting.
-- Robert
I decided to try it myself with varying numbers of the following three
separate tests
1. mov eax, ebx
2. mov eax, [esp]
In each case the AMD cpu executed three per cycle. Having made stack
space available I also tried
3. mov [esp], eax
Same result - three per cycle. As I mentioned this is probably more to
do with architectural matters. At any rate, on this CPU don't worry
about the top of the stack being slow. It isn't. Actual performance
will depend on CPU and instruction mix. Try it on your own CPU but
your results will likely be similar.
James
> I decided to try it myself with varying numbers of the following
I just did so as well and MMX registers, which
are my main interest, are faster than the stack.
movd mm0, eax
movd mm1, ebx
movd mm2, edx
movd mm3, esi
movd mm4, edi
movd eax, mm0
movd ebx, mm1
movd edx, mm2
movd esi, mm3
movd edi, mm4
runs in only 81% of the time that this stack-based code does:
push eax
push ebx
push edx
push esi
push edi
pop edi
pop esi
pop edx
pop ebx
pop eax
I also tried an experiment of doing some math
operations and move operations when 6 variables
are stored on the stack (L1), which I commonly
see when I examine compiler-generated code,
versus when they are all in registers, and
the registers-only approach ran in 2/3 the time.
This is on a 1.6 GHz Core 2 Duo.
On my P4 HT, the stack code runs about 7.5% *faster* than the MMX
code, so as was pointed out it all depends on the CPU and other
factors.
Richard.
http://www.rtrussell.co.uk/
To reply by email change 'news' to my forename.
Quite possibly true. push/pop are _not_ speed optimized
(they are for size, and that matters for execute-once code)
due to the r-m-w on esp .
Try explicit addressing:
move eax, [esp+8] ; presuming this is a reserved local
...
-- Robert
On Nov 19, 2:34=A0pm, Robert Redelmeier <red...@ev1.net.invalid> wrote:
>
> Quite possibly true. =A0push/pop are _not_ speed optimized
> (they are for size, and that matters for execute-once code)
> due to the r-m-w on esp .
I don't think that's correct. Even on the 586, push and pop are
pairable, right? And the PPro/PII did register renaming for the stack
pointer, I think. (Only "push [mem]" is slower on 486 yet ironically
faster on newer AMD chips.)
"A" <questi...@MUNGED.microcosmotalk.com> wrote in message
news:4b0580e4$0$5111$9a6e...@unlimited.newshosting.com...
push and pop need extra micro-ops and similar...
push eax --> sub esp, 4; mov [esp], eax
mov may well be the same or faster.
for example:
mov [esp+12], eax
mov [esp+8], ecx
mov [esp+4], edx
mov [esp+0], ebx
mov eax, [esp+12]
mov ecx, [esp+8]
mov edx, [esp+4]
mov ebx, [esp+0]
...
Yes, actually. The stack, because it is commonly accessed, is virtually
always in the cache. Pushing and popping registers when the stack is
within the cache is a 1-cycle operation. Register-to-register moves are
also 1-cycle operation.
So, in some cases, when you start running out of registers, it's quicker to
push and pop than it is to flush part of the register file to another
location in memory.
Instruction timing anaysis is very non-intuitive today. If you believe
that floating point operations are inherently much slower than integer
operations, for example, you would be wrong.
--
Tim Roberts, ti...@probo.com
Providenza & Boekelheide, Inc.
"Tim Roberts" <ti...@MUNGED.microcosmotalk.com> wrote in message
news:4b0679ad$0$4970$9a6e...@unlimited.newshosting.com...
yep, and this is maybe also part of why MSVC's output on x64 is not as slow
as one would otherwise expect.
the goal then is reducing the total number of operations, and if possible,
using registers rather than memory (although, the difference is no longer
severe in the case of the cache).
hence, my prior complaints about SysV (removed more detailed explanation):
I expect, on average, that more time will be spent shuffling things around
than is saved by the (IMO overly) agressive use of registers...
IMO, Win64 should be a little better, if anything, because it is far less
agressive about its use of registers, provides spill space, and is in
general much simpler.
granted, Win64 is not perfect:
its prologue/epilogue concept is a little odd, and I would rather the frame
pointer be fixed place and always the same reg, but oh well (my compiler
always uses RBP as a frame pointer, and MSVC seems to like to use R12).
the other major cost is that it isn't really usable on Linux (prior
observation would be that it would be a bit of a mess trying to bridge it
with SysV...), and at the time would offer less payoff than simply running
code in an interpreter (this turning into the x86 interpreter thinggy...).
I also like PE32+ a little more than ELF64, granted neither is perfect IMO.
granted, maybe the design of ELF64 makes a little more sense on non-x86
systems, but PE32+ is likely better able to exploit the oddities of x64
(without reliance on information beyond what is normally given to the C
compiler). (optimization of code using ELF64 would likely require link time
knowledge, and subtle violations of the SysV ABI, but this should work in
practice).
that, or be able to compile code on Linux using some of the "funkiness"
Windows uses for DLL's (such as '__declspec').
alternatively, if code were compiled in batches, the compiler could infer
much of this without resorting to something like '__declspec'.
I forget, I think GCC may be able to compile in batches (where a big glob of
source files is given on the command line), but I am not certain if it uses
this for optimization purposes, although observationally I suspect MSVC
does...
> A <questi...@munged.microcosmotalk.com> wrote in part:
>> At any rate, I have found some numbers and these show
>> PUSH/POP without a cache miss should be faster than using
>> XMM registers but not MMX. See here:
>> http://www.agner.org/optimize/instruction_tables.pdf
> Always this "should be". Why? Just try it and measure!
> I have, and using near stack almost always incurs ZERO
> penalty vs reg-reg. Your Re-Order Buffer at work.
If your code is already doing more loads and stores than ALU
operations, then it should be faster to use the ALU to make copies
because your program won't be competing for busy units.
If ALU operations predominate, it can get faster to use memory moves
because they utilize otherwise idle units.
Memory moves carry a hazard besides cache misses, however: if another
load or store has to wait for a long sequence to complete before its
address is known, memory moves have to wait at least for the difficult
address to be computed before they can proceed. This can stall the
processor.
--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end
Maybe so, but the original Pentium concept of "pairable"
is entirely outmoded. OpCodes are decoded into multiple
Microoperations that get executed out-of-order
> And the PPro/PII did register renaming for the stack pointer,
So? How would this help? The renamed register would
still have to wait for the update (subtract 4)
-- Robert
MSVC, like GCC, doesn't do anything special if you specify several
source modules on the command line. It just compiles them in
sequence. OTOH, if you ask for link time code generation, then it
will optimize across program boundaries, but it will do that if you
compile several at a time from one invocation of CL, or you do several
CL's before running the linker.
Yeah, 32-bit Win98 on my AMD X2 5600+ 2.8Ghz is much, much faster than on my
K6-2 500Mhz. But, 16-bit DOS seems to run much, much slower. I'd swear
16-bit core on the 5600+ is implemented as a 25Mhz 486...
Rod Pemberton
I haven't intimately and passionately followed cpu design in a decade or
more. So, my very limited understanding of this is that that's only true
for certain instructions on certain cpu's. The newer x86 based cpu's have
three instruction decoders. Two for simple instructions and one for complex
instructions. The simple instructions are broken in uops and depending on
the cpu rebuilt into other uops via "micro-ops fusion". The newer cpu's can
also merge entire x86 instructions together via "macro-fusion". But, from
what I've been able to find, the "complex" instructions, e.g., string
instructions etc., are microcoded sequences which block out-of-order
execution.
Rod Pemberton
On Nov 20, 6:05=A0pm, "Rod Pemberton" <do_not_h...@nohavenot.cmm> wrote:
>
> Yeah, 32-bit Win98 on my AMD X2 5600+ 2.8Ghz is much, much faster than on=
my
> K6-2 500Mhz. =A0But, 16-bit DOS seems to run much, much slower. =A0I'd sw=
ear
> 16-bit core on the 5600+ is implemented as a 25Mhz 486...
I'm not sure what programs or benchmarks gave you that idea. Surely it
doesn't actually run slower but maybe comparatively not as fast as
you'd expect?? I do know that a brief look at the AMD optimization
guide seemed very biased towards 32-bit code (with 16-bit being
somewhat less optimized) as well as not supporting certain speed
optimizations used for older machines (e.g. 486). While I don't
personally hate the x86 architecture, I will concede that it's very
very hard to optimize for several of its cpu variants in one single
program.
P.S. Not directly related, but as time goes by (maybe I'm spoiled by
my other multicore machine), WinXP on this P4 doesn't seem nearly as
fast. Granted, antivirus/antispyware slows it down, but I can't help
but wonder what cpus MS targeted (intially P3, I would guess, which is
way different than the P4) with their compiler. I think I read
somewhere that they optionally use SSE2 if supported in some places. I
don't know, to me it's both equally frustrating and curious. :-)
>Yeah, 32-bit Win98 on my AMD X2 5600+ 2.8Ghz is much, much faster than on my
>K6-2 500Mhz. But, 16-bit DOS seems to run much, much slower. I'd swear
>16-bit core on the 5600+ is implemented as a 25Mhz 486...
How are you running that 16-bit DOS? Win98 allows
you to run in real-mode with a bit of persuasion,
but otherwise a simple command prompt is still in
VM86 protected mode. Maybe the VM86
implementation is the thing that differs?
Best regards,
Bob Masta
DAQARTA v4.51
Data AcQuisition And Real-Time Analysis
www.daqarta.com
Scope, Spectrum, Spectrogram, Sound Level Meter
FREE Signal Generator
Science with your sound card!
"robert...@yahoo.com" <robert...@MUNGED.microcosmotalk.com> wrote in
message news:4b072eb8$0$5124$9a6e...@unlimited.newshosting.com...
well, mostly what I have observed is that the compiler (MSVC) seems to go a
good deal faster with a bunch of files at once (at least, a good deal faster
than GCC when compiling single files), so I suspect "something" is going on
(maybe optimizing reading in the headers?... I tend to use the "single super
header included per source file" strategy, as opposed to the "teh crapload
of little headers" strategy).
usually, this would be for compiling a bunch of files to spit out a DLL.
granted, in general, most of the compiler output I have seen from MSVC is
not, exactly, all that good (in terms of optimizations), but hell at least
it works...
but, at least there is no requirement that calls be done indirectly via the
GOT, as exists in the SysV ABI.
then again, I guess one would have to compare the exact differences between:
call foo
and:
call [gotrel foo] ;(or whatever the 'proper' syntax is here)
call [G.foo] ;this is what my assembler had used ('G.', 'S.', ...
are special magic prefixes...).
so, maybe the cost is low enough to be ignorable...
"Rod Pemberton" <do_no...@nohavenot.cmm> wrote in message
news:4b072ec3$0$5124$9a6e...@unlimited.newshosting.com...
dunno...
but, in my interpreter at least, 16-bit addressing involves additional logic
(such as masking off segment offsets so that they wrap, ...). (granted, I
never really finished or tested 16-bit support).
it is possible that maybe real CPUs do some vaguely similar stuff, so for
example, when loading segment registers, lots of internal operations are
done, and when doing addressing, additional logic is used, ...
this could be, for example, to save on needing specialized hardware for
cases "no one really uses anyways".
>
> Rod Pemberton
>
>
"Bob Masta" <N0S...@daqarta.com> wrote in message
news:4b0808ce$0$4895$9a6e...@unlimited.newshosting.com...
>
> On 21 Nov 2009 00:05:23 GMT, "Rod Pemberton"
> <do_no...@nohavenot.cmm> wrote:
>
> >Yeah, 32-bit Win98 on my AMD X2 5600+ 2.8Ghz is much, much faster than on
> >my K6-2 500Mhz. But, 16-bit DOS seems to run much, much slower. I'd
> >swear 16-bit core on the 5600+ is implemented as a 25Mhz 486...
>
> How are you running that 16-bit DOS?
I run it both as RM MS-DOS and as a "console" window in Win98 SE.
> Maybe the VM86
> implementation is the thing that differs?
Yes, the "console" is faster on both machines. I think this is primarily
due to 32-bit disk access. Smartdrv in RM DOS improves performance, but it
still seems real slow on this cpu. It wouldn't surprise me if they
implemented a basic 16-bit core just for support purposes. But, that choice
seems odd for this cpu. It's possible it's not DOS, e.g., BIOS, since the
drive is SATA and using BIOS emulation of IDE. Could use of SMM be killing
16-bit performance? If I ever get the older machine up and running again, I
intend to find some test to check if my perception matches reality.
Rod Pemberton
Not in general, no.
However, since MMX registers are overlayed on the x87 FPU unit, the
pathway between integer regs and these could be quite slow.
You simply have to measure this on your designated target platform!
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
But that's not the same thing:
You are timing multiple back-to-back stack accesses vs the same number
of reg-reg moves, this generates sequential dependencies between all of
the ESP updates.
You did state that you only needed to save a single register, which is
harder to measure in isolation.
All it takes is one cache miss.
No.
Inside an inner (double) loop, you can never suffer more than a single
initial cache miss, and even that is extremely unlikely for [ESP].
Anyway, it really doesn't matter until you've actually measured both in
real production code.
.data?
_eax dd ?
_ecx dd ?
_edx dd ?
etc ....
Do the speed compare between integer to MMX registers against MOV REG
to MEM
mov _eax, eax
mov _ecx, ecx
mov _edx, edx
etc ....
The latter performs well enough to be useful in code that does not
have to be re-entrant.
While memory in the uninitialised data section may be further away
that close cache, read it once and it won't be any longer.
Long ago in 16 bit Windows where stack space was very limited for
DLLs, passing data in globals worked fine.
Regards,
hutch