WAIT... why?

Bram Bos

unread,

Aug 6, 2008, 2:09:51 AM8/6/08

to

There are a couple of lines in my code that get executed almost 400,000 times per second - so they are good candidates for optimization.

Looking at the code generated by Delphi 5 in the CPU window I see this:

//lbuffer^ := lbuffer^ + leftvol * l;
fld qword ptr [ebp-$28]
fmul qword ptr [ebp-$38]
fadd qword ptr [edi]
fstp qword ptr [edi]
wait
//inc( lbuffer );
add edi, $08
//rbuffer^ := rbuffer^ + rightvol * l;
fld qword ptr [ebp-$30]
fmul qword ptr [ebp-$38]
mov eax, [ebp-$20]
fadd qword ptr [eax]
mov eax, [ebp-$20]
fstp qword ptr [eax]
wait
//inc( rbuffer );
add qword ptr [ebp-$20], $08

I notice there is a number of WAIT instructions in there. Why? It seems like a waste of time. But obviously Borland had a good reason to put them there...

Also, I guess I can speed up the code above a little bit by using FPU registers and local variables?

Thanks for any enlightened insights.

Dan Downs

unread,

Aug 6, 2008, 7:13:07 AM8/6/08

to

I might be missing something but why is are you multiplying by 1? I
would think removing that would speed things up. FPU ASM is way out of
my league, but I thought x86 32bit isa used a stack based FPU so I don't
know if there are registers per say, you'd have to convert it to SSE.

--
DD
- Always assume I'm posting without coffee.

Bram Bos

unread,

Aug 6, 2008, 7:53:03 AM8/6/08

to

Dan Downs <ddo...@nospam.online-access.com> wrote:
>Bram Bos wrote:
>> There are a couple of lines in my code that get executed almost 400,000 times per second - so they are good candidates for optimization.
>>
>> Looking at the code generated by Delphi 5 in the CPU window I see this:
>>
>> //lbuffer^ := lbuffer^ + leftvol * l;

[snip]

>> //rbuffer^ := rbuffer^ + rightvol * l;

Thanks for having a look at it.

>I might be missing something but why is are you multiplying
>by 1?

I'm actually multiplying by "L", which is a single float variable in this procedure containing a sampled value between -1.0 and +1.0, but apparently the lowercase L looks very similar to a "1" in some fonts ;-)

>FPU ASM is way out of my league, but I thought x86 32bit isa
>used a stack based FPU so I don't know if there are
>registers per say,

The x86 FPU uses a stack indeed, but you can still access the 8 items at the top of the stack individually, like registers, as St(0) .. St(7)

>you'd have to convert it to SSE.

I wish I could - that would have major potential in this situation - however Delphi 5 doesn't support SSE (nor MMX, but there's a handy little byte-code converter available for MMX).

Thanks,
Bram

John Herbster

unread,

Aug 6, 2008, 8:22:02 AM8/6/08

to

"Bram Bos" <bur...@gmail.com> wrote

> There are a couple of lines in my code that get executed

> almost 400,000 times per second ...
> Lbuffer^ := Lbuffer^ + leftvol * L;
> inc( Lbuffer );
> Rbuffer^ := Rbuffer^ + rightvol * L;
> inc( Rbuffer );

Bram,
If you could show us some of the surrounding code,
it would help.
Rgds, JohnH

Bram Bos

unread,

Aug 6, 2008, 8:52:11 AM8/6/08

to

"John Herbster" <herb-sci1_AT_sbcglobal.net> wrote:

>If you could show us some of the surrounding code,
>it would help.

Hi John,

I've simplified the code a bit, but here is the core of the
procedure. Basically it reads a sample from a sound lookup
table, applies volumes to it and stores it in 2 separate
(left/right) buffers.

All samples are "single" values, ranging from -1 to +1...
The two lines I singled out earlier are (according to the profiler) using by far the most CPU time in my entire project, so any performance gain would be useful.

Procedure getSamplesFromWavetable( leftbuf, rightbuf : pointer; numsamples : integer );
var
lbuffer, rbuffer : ^single;
n : integer;
sample, leftvol, rightvol : single;
idx : integer;
begin

Lbuffer := leftbuf;
Rbuffer := rightbuf;

//determine the volume of the left audio channel
leftvol := getLeftVolume();
//determine the volume of the right channel
rightvol := getRightVolume();

for n := 0 to numSamples - 1 do
begin

// get an index into a lookup table containing
// e.g. a sinewave or a sawtooth. This part has already
// been nicely optimized using fixed point arithmatic
idx := getNextIndexIntoWavetable();

// get a sample from the wavetable
sample := oscWavetable[idx];

// now apply volume adjustments to this sample and
// store in separate left/right audio buffers
Lbuffer^ := Lbuffer^ + leftvol * sample;
inc( Lbuffer );
Rbuffer^ := Rbuffer^ + rightvol * sample;
inc( Rbuffer );

end;

Jens Gruschel

unread,

Aug 6, 2008, 9:26:11 AM8/6/08

to

> //lbuffer^ := lbuffer^ + leftvol * l;
> fld qword ptr [ebp-$28]
> fmul qword ptr [ebp-$38]
> fadd qword ptr [edi]
> fstp qword ptr [edi]
> wait
> //inc( lbuffer );
> add edi, $08

[...]

> I notice there is a number of WAIT instructions in there. Why? It seems like a waste of time.

"wait" is necessary to synchronize the FPU with the CPU. When the 80x87
was introduced both were on separate chips working separately and in
parallel. The "wait" in the code above makes sure edi is not used by the
CPU before the FPU has stored the result to this register. Also see
http://www.website.masmforum.com/tutorials/fptute/fpuchap3.htm#fwait

Not sure whether wait is still necessary today where CPU and FPU are
integrated on a single chip. Does anybody know?

Jens Gruschel

unread,

Aug 6, 2008, 9:40:40 AM8/6/08

to

> Not sure whether wait is still necessary today where CPU and FPU are
> integrated on a single chip. Does anybody know?

According to http://faydoc.tripod.com/cpu/fwait.htm "wait" is also used
for the FPU exceptions. Anyway, "sync" probably would have been a better
name for it :-)

Bram Bos

unread,

Aug 6, 2008, 9:47:30 AM8/6/08

to

Jens Gruschel <nos...@thisurldoesnotexist.com> wrote:

>Anyway, "sync" probably would have been a better name for it :-)

Ah.. that makes sense.. at least it makes the instruction sound a little bit more useful than "wait" :-)

I'll try and figure out whether it's still essential to have
critical code littered with waits. I consider the PIII my
bottom line CPU, and that one definitely had the FPU integrated into the CPU.

Thanks for the info!

Jens Gruschel

unread,

Aug 6, 2008, 10:11:53 AM8/6/08

to

> I consider the PIII my
> bottom line CPU, and that one definitely had the FPU integrated into the CPU.

On the 80286 it was not integrated yet, on the 80486 it was (at least
the DX). Not sure about the 80386 (I never had one, I switched from 286
to 486).

--
Jens Gruschel
http://www.pegtop.net

Dan Downs

unread,

Aug 6, 2008, 10:14:31 AM8/6/08

to

Bram Bos wrote:
> Dan Downs <ddo...@nospam.online-access.com> wrote:
>> Bram Bos wrote:
>>> There are a couple of lines in my code that get executed almost 400,000 times per second - so they are good candidates for optimization.
>>>
>>> Looking at the code generated by Delphi 5 in the CPU window I see this:
>>>
>>> //lbuffer^ := lbuffer^ + leftvol * l;
> [snip]
>>> //rbuffer^ := rbuffer^ + rightvol * l;
>
> Thanks for having a look at it.
>
>> I might be missing something but why is are you multiplying
>> by 1?
>
> I'm actually multiplying by "L", which is a single float variable in this procedure containing a sampled value between -1.0 and +1.0, but apparently the lowercase L looks very similar to a "1" in some fonts ;-)

Doh! Actually 1 and l look almost the same on my system, the one is a
pixel shorter and a touch less blurry.

Are leftvol and rightvol constants in the loop?

DD

Dan Downs

unread,

Aug 6, 2008, 10:15:35 AM8/6/08

to

> Are leftvol and rightvol constants in the loop?

I should have looked at the other posts first, never mind.

Dan Downs

unread,

Aug 6, 2008, 10:27:21 AM8/6/08

to

Depending on how getNextIndexIntoWavetable() works it looks like you
could split the loop into two threads. Left & right channel threads and
wait for both before returning, that mixed with any possible fpu
improvements the others come up with could add up to quite a bit. It
would depend on numsamples being large enough to warrant using threads
though. If its more of a real time problem then some sort of threaded
pipeline/queuing structure might be better.

I've been following http://17slon.com/blogs/gabr/blogger.html with much
interest lately.

Dan Downs

unread,

Aug 6, 2008, 10:17:39 AM8/6/08

to

Jens Gruschel wrote:
>> I consider the PIII my
>> bottom line CPU, and that one definitely had the FPU integrated into
>> the CPU.
>
> On the 80286 it was not integrated yet, on the 80486 it was (at least
> the DX). Not sure about the 80386 (I never had one, I switched from 286
> to 486).
>

386 was still external, I had a math co-processor.

Bob Gonder

unread,

Aug 6, 2008, 10:58:51 AM8/6/08

to

Bram Bos wrote:

> // now apply volume adjustments to this sample and
> // store in separate left/right audio buffers
> Lbuffer^ := Lbuffer^ + leftvol * sample;
> inc( Lbuffer );
> Rbuffer^ := Rbuffer^ + rightvol * sample;
> inc( Rbuffer );

Change to

Lbuffer^ := Lbuffer^ + leftvol * sample;

Rbuffer^ := Rbuffer^ + rightvol * sample;

inc( Lbuffer );
inc( Rbuffer );

Will eliminate the first wait.
But I don't think it will speed up very much.

As to Why the Wait...If you didn't wait,
edi would be changed to edi+8 before
the fstp[edi] got around to executing.

This might be a good candidate for threadng.
One thread for each channel on a multi-cpu
chip should be twice as fast.

Dennis Passmore

unread,

Aug 6, 2008, 1:24:45 PM8/6/08

to

check your email

Derek Jones

unread,

Aug 6, 2008, 2:01:41 PM8/6/08

to

"Jens Gruschel" <nos...@thisurldoesnotexist.com> wrote in message
news:4899a627>

>
> Not sure whether wait is still necessary today where CPU and FPU are
> integrated on a single chip. Does anybody know?

Not according to the Intel IA-32 docs. It's only use these days is to
syncronize floating point exceptions. And if you look at where the Delphi
compiler places them it is after each source line, so that if an exception
is generated, it is handled and flagged for the correct line.

If you code by hand and omit the waits, you will often get the exception
some instructions later than where it actualy occured.

Derek Jones

unread,

Aug 6, 2008, 2:05:22 PM8/6/08

to

"Bob Gonder" <no...@nowhere.invalid> wrote in message
news:9aej94dup0sk6ro0i...@4ax.com...

>
> As to Why the Wait...If you didn't wait,
> edi would be changed to edi+8 before
> the fstp[edi] got around to executing.

These days it's only use is to sync floating point exceptions. If you look
at where they are placed by the compiler, it is at the end of each source
line, so the debugger can flag the correct source line should an exception
occur.

The Intel IA32 docs state that it is for exception syncronization, nothing
else.

Carl Caulkett

unread,

Aug 6, 2008, 1:55:33 PM8/6/08

to

Bram Bos wrote:

>
> There are a couple of lines in my code that get executed almost
> 400,000 times per second - so they are good candidates for
> optimization.

Hi, are you the Bram Bos of Hammerhead and Tuareg fame? Although I'm
more of an FL Studio man these days, I had fun with Hammerhead when it
first came out.

--
Cheers,
Carl

Bram Bos

unread,

Aug 6, 2008, 2:44:02 PM8/6/08

to

Hi Carl,

Yes.. that's me. Funny you remember my name.
I made Hammerhead over 10 years ago! Incidentally that was
my first real Delphi application ;-)

I'm currently working on a freeware all-in-one music
application for tiny netbooks, like the Asus EEE PC.
These typically have small screens and slow, last-gen CPUs so
that's why I need to optimize my synth and mix code as much as
possible (my own EEE PC has an underclocked Celeron M @ 630MHz
so that explains why I'm interested in assembler all of a
sudden).

Cheers,
Bram

Jens Gruschel

unread,

Aug 6, 2008, 5:02:48 PM8/6/08

to

> Yes.. that's me. Funny you remember my name.
> I made Hammerhead over 10 years ago! Incidentally that was
> my first real Delphi application ;-)

Hammerhead was great, one of the first applications I had when I
switched from DOS to Windows. Thanks for that!

> I'm currently working on a freeware all-in-one music
> application for tiny netbooks, like the Asus EEE PC.

Sounds interesting. Let us know when it's done :-)

Dennis Christensen

unread,

Aug 7, 2008, 2:46:37 AM8/7/08

to

Hi Derek

I agree

I have often tried removing FWAIT instructions in BASM code and I have never
been able to measure any difference in speed.

They are needed after FP instructions that can raise exceptions if one wants
the exception at the rigth line. If one does not care then having one at the
end of a function is enough.

Regards
Dennis

Dennis Christensen

unread,

Aug 7, 2008, 2:41:58 AM8/7/08

to

Hi

> I notice there is a number of WAIT instructions in there. Why? It seems
> like a waste of time. But obviously Borland had a good reason to put them
> there...

WAIT is not a sleep.

> Also, I guess I can speed up the code above a little bit by using FPU
> registers and local variables?

The WAIT instruction = FWAIT takes 0 clock cycles to execute on modern
processors. Do not worry about them.

> Thanks for any enlightened insights.

http://download.intel.com/design/pentium/MANUALS/24319101.PDF
page 485
Description

Causes the processor to check for and handle pending, unmasked,
floating-point exceptions

before proceeding. (FWAIT is an alternate mnemonic for the WAIT).

This instruction is useful for synchronizing exceptions in critical sections
of code. Coding a

WAIT instruction after a floating-point instruction insures that any
unmasked floating-point

exceptions the instruction may raise are handled before the processor can
modify the instruction's

results. See the section titled "Floating-Point Exception Synchronization"
in Chapter 7 of

the Intel Architecture Software Developer's Manual, Volume 1, for more

Regards
Dennis

Bram Bos

unread,

Aug 7, 2008, 2:58:05 AM8/7/08

to

Jens Gruschel <nos...@thisurldoesnotexist.com> wrote:

>> I'm currently working on a freeware all-in-one music
>> application for tiny netbooks, like the Asus EEE PC.
>
>Sounds interesting. Let us know when it's done :-)

I will and I definitely need to credit the people in this group in my documentation, because I've already learned an awful lot just reading through the archives of this newsgroup.

Dennis Christensen

unread,

Aug 7, 2008, 2:47:44 AM8/7/08

to

Hi

Why care about anything older than P4 these days?

Regards
Dennis

Bram Bos

unread,

Aug 7, 2008, 3:02:58 AM8/7/08

to

"Dennis Christensen" <d...@GateHouse.dk> wrote:
>Hi
>
>Why care about anything older than P4 these days?

For this particular project (as I explained further down the thread) I'm writing specifically for ultraportable netbooks like the Asus EEE PC. The basic model uses an old Celeron M (Dothan-900) running underclocked at 630MHz. So it's basically a slowed down Pentium III with some P4 sauce.

These are still far from being obsolete.

David J Taylor

unread,

Aug 7, 2008, 2:57:08 AM8/7/08

to

I have at least two production real-time data capture systems still
running Pentium III with Windows 2000, and no intention to replace them
any time soon.

David

thaddy

unread,

Aug 7, 2008, 4:12:18 AM8/7/08

to

De compiler has no way of knowing you are doing audio and that it can safely ignore de fpu state in most cases.
You can set the FPU state to ignore floating point exceptions. In that case the wait will take no cycles.

btw: I have some old BASM code that does the same as you do here:
http://thaddy.co.uk/tdkdsplib.zip

Maybe you can use it.

Bram Bos

unread,

Aug 7, 2008, 4:20:44 AM8/7/08

to

Thanks.. it looks like I should give the AdjustVolume32f
procedure a good closer look. Very helpful!

Q Correll

unread,

Aug 7, 2008, 12:42:41 PM8/7/08

to

Dennis,

| The WAIT instruction = FWAIT takes 0 clock cycles to execute on modern
processors. Do not worry about them.

How about the instruction decode cycle? How does the cpu know it's a WAIT
until the instruction is read?

--
Q

08/07/2008 09:41:26

XanaNews Version 1.18.1.11 [Leonel's & Pieter's & Q's Mods]

Bob Gonder

unread,

Aug 7, 2008, 2:02:36 PM8/7/08

to

Dennis Christensen wrote:

>The WAIT instruction = FWAIT takes 0 clock cycles to execute on modern
>processors. Do not worry about them.

Hmmm.. My 3GHz Quad runs (using GetTickCount() )
10 million 16-nop loop in 31
10 million 16-wait loop in 62

My 2GHz Celeron laptop in 50/90

So it looks like it takes twice as long as the 0-cycle nop :->

Hubert Seidel

unread,

Aug 9, 2008, 2:14:28 PM8/9/08

to

Hello Bram,

"Bram Bos" <bur...@gmail.com> schrieb im Newsbeitrag
news:4899402f$1...@newsgroups.borland.com...

>
> There are a couple of lines in my code that get executed almost 400,000
times per second - so they are good candidates for optimization.

> Looking at the code generated by Delphi 5 in the CPU window I see this:

..

> Also, I guess I can speed up the code above a little bit by using FPU
registers and local variables?

If you have an example for us that i can compile and run, then i can try to
supporting.
We had success "I need the fastest routine"
news:486e...@newsgroups.borland.com
e.g. e small projekt with a procedure than fills the array (sin/cos) and a
procedure with the main-function.
(I have Delphi 5 too)

mfg.
Herby

P.S:
Sorry, my english,...i hope yor understand me :-)

--
http://www.hubert-seidel.de

Hubert Seidel

unread,

Aug 9, 2008, 3:53:33 PM8/9/08

to

Hello Bram,

"Bram Bos" <bur...@gmail.com> schrieb im Newsbeitrag

news:48999e7b$1...@newsgroups.borland.com...
...
> Procedure getSamplesFromWavetable( leftbuf, rightbuf : pointer; numsamples
: integer );
> var
> lbuffer, rbuffer : ^single;
> n : integer;
> sample, leftvol, rightvol : single;
> idx : integer;
> begin
>
> Lbuffer := leftbuf;
> Rbuffer := rightbuf;
>
> //determine the volume of the left audio channel
> leftvol := getLeftVolume();
> //determine the volume of the right channel
> rightvol := getRightVolume();
>
> for n := 0 to numSamples - 1 do
> begin
>
> // get an index into a lookup table containing
> // e.g. a sinewave or a sawtooth. This part has already
> // been nicely optimized using fixed point arithmatic
###
> idx := getNextIndexIntoWavetable();
###
What do this function?
It is possible that this is the bottleneck, i don't know without to see the
code..
>
> // get a sample from the wavetable
> sample := oscWavetable[idx];
###
I think move the code from getNextIndexIntoWavetable to here
and combine with sample := with some transpositions
could incrase the speed.

>
> // now apply volume adjustments to this sample and
> // store in separate left/right audio buffers
> Lbuffer^ := Lbuffer^ + leftvol * sample;
> inc( Lbuffer );
> Rbuffer^ := Rbuffer^ + rightvol * sample;
> inc( Rbuffer );

With external function-calls, you (or i) can't be sure what to do the FPU.

I have some ideas to optimize the code, if there are not surprises in
external call.

mfg.
Herby

--
http://www.hubert-seidel.de