Looking at the code generated by Delphi 5 in the CPU window I see this:
//lbuffer^ := lbuffer^ + leftvol * l;
fld qword ptr [ebp-$28]
fmul qword ptr [ebp-$38]
fadd qword ptr [edi]
fstp qword ptr [edi]
wait
//inc( lbuffer );
add edi, $08
//rbuffer^ := rbuffer^ + rightvol * l;
fld qword ptr [ebp-$30]
fmul qword ptr [ebp-$38]
mov eax, [ebp-$20]
fadd qword ptr [eax]
mov eax, [ebp-$20]
fstp qword ptr [eax]
wait
//inc( rbuffer );
add qword ptr [ebp-$20], $08
I notice there is a number of WAIT instructions in there. Why? It seems like a waste of time. But obviously Borland had a good reason to put them there...
Also, I guess I can speed up the code above a little bit by using FPU registers and local variables?
Thanks for any enlightened insights.
I might be missing something but why is are you multiplying by 1? I
would think removing that would speed things up. FPU ASM is way out of
my league, but I thought x86 32bit isa used a stack based FPU so I don't
know if there are registers per say, you'd have to convert it to SSE.
--
DD
- Always assume I'm posting without coffee.
Thanks for having a look at it.
>I might be missing something but why is are you multiplying
>by 1?
I'm actually multiplying by "L", which is a single float variable in this procedure containing a sampled value between -1.0 and +1.0, but apparently the lowercase L looks very similar to a "1" in some fonts ;-)
>FPU ASM is way out of my league, but I thought x86 32bit isa
>used a stack based FPU so I don't know if there are
>registers per say,
The x86 FPU uses a stack indeed, but you can still access the 8 items at the top of the stack individually, like registers, as St(0) .. St(7)
>you'd have to convert it to SSE.
I wish I could - that would have major potential in this situation - however Delphi 5 doesn't support SSE (nor MMX, but there's a handy little byte-code converter available for MMX).
Thanks,
Bram
Bram,
If you could show us some of the surrounding code,
it would help.
Rgds, JohnH
>If you could show us some of the surrounding code,
>it would help.
Hi John,
I've simplified the code a bit, but here is the core of the
procedure. Basically it reads a sample from a sound lookup
table, applies volumes to it and stores it in 2 separate
(left/right) buffers.
All samples are "single" values, ranging from -1 to +1...
The two lines I singled out earlier are (according to the profiler) using by far the most CPU time in my entire project, so any performance gain would be useful.
Procedure getSamplesFromWavetable( leftbuf, rightbuf : pointer; numsamples : integer );
var
lbuffer, rbuffer : ^single;
n : integer;
sample, leftvol, rightvol : single;
idx : integer;
begin
Lbuffer := leftbuf;
Rbuffer := rightbuf;
//determine the volume of the left audio channel
leftvol := getLeftVolume();
//determine the volume of the right channel
rightvol := getRightVolume();
for n := 0 to numSamples - 1 do
begin
// get an index into a lookup table containing
// e.g. a sinewave or a sawtooth. This part has already
// been nicely optimized using fixed point arithmatic
idx := getNextIndexIntoWavetable();
// get a sample from the wavetable
sample := oscWavetable[idx];
// now apply volume adjustments to this sample and
// store in separate left/right audio buffers
Lbuffer^ := Lbuffer^ + leftvol * sample;
inc( Lbuffer );
Rbuffer^ := Rbuffer^ + rightvol * sample;
inc( Rbuffer );
end;
end;
[...]
> I notice there is a number of WAIT instructions in there. Why? It seems like a waste of time.
"wait" is necessary to synchronize the FPU with the CPU. When the 80x87
was introduced both were on separate chips working separately and in
parallel. The "wait" in the code above makes sure edi is not used by the
CPU before the FPU has stored the result to this register. Also see
http://www.website.masmforum.com/tutorials/fptute/fpuchap3.htm#fwait
Not sure whether wait is still necessary today where CPU and FPU are
integrated on a single chip. Does anybody know?
According to http://faydoc.tripod.com/cpu/fwait.htm "wait" is also used
for the FPU exceptions. Anyway, "sync" probably would have been a better
name for it :-)
>Anyway, "sync" probably would have been a better name for it :-)
Ah.. that makes sense.. at least it makes the instruction sound a little bit more useful than "wait" :-)
I'll try and figure out whether it's still essential to have
critical code littered with waits. I consider the PIII my
bottom line CPU, and that one definitely had the FPU integrated into the CPU.
Thanks for the info!
On the 80286 it was not integrated yet, on the 80486 it was (at least
the DX). Not sure about the 80386 (I never had one, I switched from 286
to 486).
--
Jens Gruschel
http://www.pegtop.net
Doh! Actually 1 and l look almost the same on my system, the one is a
pixel shorter and a touch less blurry.
Are leftvol and rightvol constants in the loop?
DD
I should have looked at the other posts first, never mind.
I've been following http://17slon.com/blogs/gabr/blogger.html with much
interest lately.
386 was still external, I had a math co-processor.
> // now apply volume adjustments to this sample and
> // store in separate left/right audio buffers
> Lbuffer^ := Lbuffer^ + leftvol * sample;
> inc( Lbuffer );
> Rbuffer^ := Rbuffer^ + rightvol * sample;
> inc( Rbuffer );
Change to
Lbuffer^ := Lbuffer^ + leftvol * sample;
Rbuffer^ := Rbuffer^ + rightvol * sample;
inc( Lbuffer );
inc( Rbuffer );
Will eliminate the first wait.
But I don't think it will speed up very much.
As to Why the Wait...If you didn't wait,
edi would be changed to edi+8 before
the fstp[edi] got around to executing.
This might be a good candidate for threadng.
One thread for each channel on a multi-cpu
chip should be twice as fast.
Not according to the Intel IA-32 docs. It's only use these days is to
syncronize floating point exceptions. And if you look at where the Delphi
compiler places them it is after each source line, so that if an exception
is generated, it is handled and flagged for the correct line.
If you code by hand and omit the waits, you will often get the exception
some instructions later than where it actualy occured.
These days it's only use is to sync floating point exceptions. If you look
at where they are placed by the compiler, it is at the end of each source
line, so the debugger can flag the correct source line should an exception
occur.
The Intel IA32 docs state that it is for exception syncronization, nothing
else.
>
> There are a couple of lines in my code that get executed almost
> 400,000 times per second - so they are good candidates for
> optimization.
Hi, are you the Bram Bos of Hammerhead and Tuareg fame? Although I'm
more of an FL Studio man these days, I had fun with Hammerhead when it
first came out.
--
Cheers,
Carl
Hi Carl,
Yes.. that's me. Funny you remember my name.
I made Hammerhead over 10 years ago! Incidentally that was
my first real Delphi application ;-)
I'm currently working on a freeware all-in-one music
application for tiny netbooks, like the Asus EEE PC.
These typically have small screens and slow, last-gen CPUs so
that's why I need to optimize my synth and mix code as much as
possible (my own EEE PC has an underclocked Celeron M @ 630MHz
so that explains why I'm interested in assembler all of a
sudden).
Cheers,
Bram
Hammerhead was great, one of the first applications I had when I
switched from DOS to Windows. Thanks for that!
> I'm currently working on a freeware all-in-one music
> application for tiny netbooks, like the Asus EEE PC.
Sounds interesting. Let us know when it's done :-)
I agree
I have often tried removing FWAIT instructions in BASM code and I have never
been able to measure any difference in speed.
They are needed after FP instructions that can raise exceptions if one wants
the exception at the rigth line. If one does not care then having one at the
end of a function is enough.
Regards
Dennis
> I notice there is a number of WAIT instructions in there. Why? It seems
> like a waste of time. But obviously Borland had a good reason to put them
> there...
WAIT is not a sleep.
> Also, I guess I can speed up the code above a little bit by using FPU
> registers and local variables?
The WAIT instruction = FWAIT takes 0 clock cycles to execute on modern
processors. Do not worry about them.
> Thanks for any enlightened insights.
http://download.intel.com/design/pentium/MANUALS/24319101.PDF
page 485
Description
Causes the processor to check for and handle pending, unmasked,
floating-point exceptions
before proceeding. (FWAIT is an alternate mnemonic for the WAIT).
This instruction is useful for synchronizing exceptions in critical sections
of code. Coding a
WAIT instruction after a floating-point instruction insures that any
unmasked floating-point
exceptions the instruction may raise are handled before the processor can
modify the instruction's
results. See the section titled "Floating-Point Exception Synchronization"
in Chapter 7 of
the Intel Architecture Software Developer's Manual, Volume 1, for more
Regards
Dennis
>> I'm currently working on a freeware all-in-one music
>> application for tiny netbooks, like the Asus EEE PC.
>
>Sounds interesting. Let us know when it's done :-)
I will and I definitely need to credit the people in this group in my documentation, because I've already learned an awful lot just reading through the archives of this newsgroup.
Why care about anything older than P4 these days?
Regards
Dennis
For this particular project (as I explained further down the thread) I'm writing specifically for ultraportable netbooks like the Asus EEE PC. The basic model uses an old Celeron M (Dothan-900) running underclocked at 630MHz. So it's basically a slowed down Pentium III with some P4 sauce.
These are still far from being obsolete.
I have at least two production real-time data capture systems still
running Pentium III with Windows 2000, and no intention to replace them
any time soon.
David
btw: I have some old BASM code that does the same as you do here:
http://thaddy.co.uk/tdkdsplib.zip
Maybe you can use it.
Thanks.. it looks like I should give the AdjustVolume32f
procedure a good closer look. Very helpful!
| The WAIT instruction = FWAIT takes 0 clock cycles to execute on modern
processors. Do not worry about them.
How about the instruction decode cycle? How does the cpu know it's a WAIT
until the instruction is read?
--
Q
08/07/2008 09:41:26
XanaNews Version 1.18.1.11 [Leonel's & Pieter's & Q's Mods]
>The WAIT instruction = FWAIT takes 0 clock cycles to execute on modern
>processors. Do not worry about them.
Hmmm.. My 3GHz Quad runs (using GetTickCount() )
10 million 16-nop loop in 31
10 million 16-wait loop in 62
My 2GHz Celeron laptop in 50/90
So it looks like it takes twice as long as the 0-cycle nop :->
"Bram Bos" <bur...@gmail.com> schrieb im Newsbeitrag
news:4899402f$1...@newsgroups.borland.com...
>
> There are a couple of lines in my code that get executed almost 400,000
times per second - so they are good candidates for optimization.
> Looking at the code generated by Delphi 5 in the CPU window I see this:
..
> Also, I guess I can speed up the code above a little bit by using FPU
registers and local variables?
If you have an example for us that i can compile and run, then i can try to
supporting.
We had success "I need the fastest routine"
news:486e...@newsgroups.borland.com
e.g. e small projekt with a procedure than fills the array (sin/cos) and a
procedure with the main-function.
(I have Delphi 5 too)
mfg.
Herby
P.S:
Sorry, my english,...i hope yor understand me :-)
--
http://www.hubert-seidel.de
"Bram Bos" <bur...@gmail.com> schrieb im Newsbeitrag
news:48999e7b$1...@newsgroups.borland.com...
...
> Procedure getSamplesFromWavetable( leftbuf, rightbuf : pointer; numsamples
: integer );
> var
> lbuffer, rbuffer : ^single;
> n : integer;
> sample, leftvol, rightvol : single;
> idx : integer;
> begin
>
> Lbuffer := leftbuf;
> Rbuffer := rightbuf;
>
> //determine the volume of the left audio channel
> leftvol := getLeftVolume();
> //determine the volume of the right channel
> rightvol := getRightVolume();
>
> for n := 0 to numSamples - 1 do
> begin
>
> // get an index into a lookup table containing
> // e.g. a sinewave or a sawtooth. This part has already
> // been nicely optimized using fixed point arithmatic
###
> idx := getNextIndexIntoWavetable();
###
What do this function?
It is possible that this is the bottleneck, i don't know without to see the
code..
>
> // get a sample from the wavetable
> sample := oscWavetable[idx];
###
I think move the code from getNextIndexIntoWavetable to here
and combine with sample := with some transpositions
could incrase the speed.
>
> // now apply volume adjustments to this sample and
> // store in separate left/right audio buffers
> Lbuffer^ := Lbuffer^ + leftvol * sample;
> inc( Lbuffer );
> Rbuffer^ := Rbuffer^ + rightvol * sample;
> inc( Rbuffer );
With external function-calls, you (or i) can't be sure what to do the FPU.
I have some ideas to optimize the code, if there are not surprises in
external call.
mfg.
Herby
--
http://www.hubert-seidel.de