I need the fastest routine

Clément Doss

unread,

Jul 4, 2008, 5:57:27 PM7/4/08

to

Hi,

I know this is the right place. I'm a just embarrassed to post such a silly question
and code. If you could shed some light, I would appreciate very much!!!

In this project, I need to find the max and min integer from an array.
I came up with this code. It seems faster than the others I tried on.

---

procedure MinMaxArray( const aArray : Array of Integer; out aMax, aMin : integer );
var
MaxArray,
MinArray : array[boolean] of integer;
i : integer;
begin
aMax :=0;
aMin :=0;
i:=high( aArray );
if i>=0 then begin
MaxArray[True] := aArray[0];
MinArray[True] := aArray[0];
while i>0 do begin
MaxArray[ aArray[i] > MaxArray[True] ] := aArray[i];
MinArray[ aArray[i] < MinArray[True] ] := aArray[i];
dec(i);
end;
aMax := MaxArray[True];
aMin := MinArray[True];
end;
end;

---

Unfortunately it's not fast enough. I need more speed. I just wanted to be sure this
routine is ok, or if there's some tricks to make it faster. If you guys think is a
waste of time trying to improve on this routine, then I'll try to optimize other
parts of the project.

I appreciate very much your help.

Clément

Q Correll

unread,

Jul 4, 2008, 7:33:56 PM7/4/08

to

Clément,

Uhhh,... what's wrong with using the Min/Max value functions:

aMax := MaxIntValue(const Data: array of Integer): Integer;

aMin := MinIntValue(const Data: array of Integer): Integer;

--
Q

07/04/2008 16:24:37

XanaNews Version 1.17.5.7 [Q's Salutation mod]

Clément Doss

unread,

Jul 4, 2008, 8:49:19 PM7/4/08

to

Q Correll wrote:
> Clément,
>
> Uhhh,... what's wrong with using the Min/Max value functions:
>
> aMax := MaxIntValue(const Data: array of Integer): Integer;
>
> aMin := MinIntValue(const Data: array of Integer): Integer;
>
>

Hi Q,

They are not fast enough :(
And I have to scan the same array twice one to get the max and another to get the min.

In fact, scanning the array and returning both value seemed always faster than
dealing with each one separated. At least in my home made benchmark. :)

Q Correll

unread,

Jul 5, 2008, 12:51:44 AM7/5/08

to

Clément,

| They are not fast enough :(
| And I have to scan the same array twice one to get the max and another
to get the min.

Try this:

procedure MinMaxArray( const aArray : Array of Integer; out aMax, aMin :
integer );

var i : integer;
begin
aMax := minInt;
aMin := maxInt;
for i := 0 to high(aArray) then
begin
aMax := max(aMax, aArray[i];
aMin := min(aMin, aArray[i];
end;
end;

--
Q

07/04/2008 21:47:29

Stig Johansen

unread,

Jul 5, 2008, 1:20:09 AM7/5/08

to

Q Correll wrote:

> Clément,
>
> | They are not fast enough :(
> | And I have to scan the same array twice one to get the max and another
> to get the min.
>
> Try this:
>
> procedure MinMaxArray( const aArray : Array of Integer; out aMax, aMin :
> integer );
> var i : integer;
> begin
> aMax := minInt;
> aMin := maxInt;
> for i := 0 to high(aArray) then
> begin
> aMax := max(aMax, aArray[i];
> aMin := min(aMin, aArray[i];
> end;
> end;

Since this apperently is performance critical, he could also just use a
pointer, something like this:

procedure MinMaxArray( const aArray : Array of Integer; out aMax, aMin :
integer );
var i : integer;

var p : pinteger ;

begin
aMax := minInt;
aMin := maxInt;

p := @aArray[0] ;
for i := 0 to high(aArray) do
begin
if p^ > aMax then aMax := p^ ;
if p^ < aMin then aMin := p^;
inc(p);
end;
end;

Then he can avoid the overhead of function calls with the min and max
functions.

--
Best regards
Stig Johansen

Nenad Trkulja

unread,

Jul 5, 2008, 11:30:14 AM7/5/08

to

Hi Clément,

try this:

procedure MinMaxArray( const aArray : Array of Integer; out aMax, aMin :
integer );
var

i, m, mmod, n: Integer;
begin
aMax := 0;
aMin := 0;
if Length(aArray) > 0 then
begin
aMin := aArray[0];
aMax := aArray[0];
m := Length(aArray) div 4;
mmod := Length(aArray) mod 4;
n := 0;
for i := 0 to m -1 do
begin
if (aArray[n] < aMin) or (aArray[n+1] < aMin) or
(aArray[n+2] < aMin) or (aArray[n+3] < aMin) then
begin
if aArray[n] < aMin then
aMin := aArray[n];
if aArray[n+1] < aMin then
aMin := aArray[n+1];
if aArray[n+2] < aMin then
aMin := aArray[n+2];
if aArray[n+3] < aMin then
aMin := aArray[n+3];
end;

if (aArray[n] > aMax) or (aArray[n+1] > aMax) or
(aArray[n+2] > aMax) or (aArray[n+3] > aMax) then
begin
if aArray[n] > aMax then
aMax := aArray[n];
if aArray[n+1] > aMax then
aMax := aArray[n+1];
if aArray[n+2] > aMax then
aMax := aArray[n+2];
if aArray[n+3] > aMax then
aMax := aArray[n+3];
end;

Inc(n, 4);
end;

for i := 0 to mmod-1 do
begin
if aArray[i+n] < aMin then
aMin := aArray[i+n];
if aArray[i+n] > aMax then
aMax := aArray[i+n];
end;
end;
end;

John Herbster

unread,

Jul 5, 2008, 12:53:19 PM7/5/08

to

"Clément Doss" <cd...@dhs.com.br> wrote
> I need the fastest routine.

I doubt it! Usually when programmers are looking for
the fastest execution of a small part of their program,
their are looking in the wrong place to speed up the
overall execution.

If you were to explain about the larger problem, we
would be better able to help you.

> I need to find the max and min integer from an array.

(1) How big are the arrays?
(2) How do the arrays get filled?
(3) How often do you need to fetch the extremes?
(4) How often do the arrays get changed?
(5) Why don't you determine the min and max
when the arrays are filled?
(6) Why don't you keep the arrays sorted?
(7) Why don't you link the elements in order?
(8) And a dozen more questions.
But mainly, we could help you better if you
explained more about the larger problem.

Rgds, JohnH

Q Correll

unread,

Jul 5, 2008, 1:25:59 PM7/5/08

to

Stig,

| Since this apperently is performance critical, he could also just use a
| pointer, something like this:

| ...

| Then he can avoid the overhead of function calls with the min and max
| functions.

Good idea!

--
Q

07/05/2008 10:23:27

Q Correll

unread,

Jul 5, 2008, 1:37:09 PM7/5/08

to

John,

| (5) Why don't you determine the min and max
| when the arrays are filled?

I had the same wonder. <g> So I just assumed that perhaps he had no
control over how/when the array was generated.

However, I think my and Stig's suggestions combine for a pretty good
optimization solution of Clément's posted code.

--
Q

07/05/2008 10:32:25

Clément Doss

unread,

Jul 5, 2008, 1:53:01 PM7/5/08

to

Hi Nenad

>
> try this:
>
> procedure MinMaxArray( const aArray : Array of Integer; out aMax, aMin :
> integer );

Thanks a lot! I will try it!

Rudy Velthuis [TeamB]

unread,

Jul 5, 2008, 1:48:39 PM7/5/08

to

Stig Johansen wrote:

> Since this apperently is performance critical, he could also just use
> a pointer, something like this:
> procedure MinMaxArray( const aArray : Array of Integer; out aMax,
> aMin : integer );
> var i : integer;
> var p : pinteger ;
> begin
> aMax := minInt;
> aMin := maxInt;
> p := @aArray[0] ;
> for i := 0 to high(aArray) do
> begin
> if p^ > aMax then aMax := p^ ;
> if p^ < aMin then aMin := p^;
> inc(p);
> end;
> end;
>
> Then he can avoid the overhead of function calls with the min and max
> functions.

I didn't check, but I'm pretty sure that code that does not use
pointers will be equally fast (but IMO, more readable, and also more
portable), since the optimizer will turn it into the code that you
posted anyway:

procedure MinMaxArray(const aArray: array of Integer;
out aMax, aMin: Integer);
var
I: Integer;
begin
aMax := MinInt;
aMin := MaxInt;
for I := 0 to High(aArray) do
begin
if aArray[I] > aMax then
aMax := aArray[I];
if aArray[I] < aMin then
aMin := aArray[I];
end;
end;

I'm not sure if this is faster (I could imagine the optimizer already
does this), but it could be:

procedure MinMaxArray(const aArray: array of Integer;
out aMax, aMin: Integer);
var
I: Integer;
Value: Integer;
begin
aMax := MinInt;
aMin := MaxInt;
for I := 0 to High(aArray) do
begin
Value := aArray[I];
if Value > aMax then
aMax := Value;
if Value < aMin then
aMin := Value;
end;
end;

--
Rudy Velthuis [TeamB] http://www.teamb.com

"Mit der Dummheit kämpfen Götter selbst vergebens"
"Against stupidity the (very) gods themselves contend in vain"
-- Friedrich von Schiller

Rudy Velthuis [TeamB]

unread,

Jul 5, 2008, 2:14:24 PM7/5/08

to

Rudy Velthuis [TeamB] wrote:

> > begin
> > if p^ > aMax then aMax := p^ ;
> > if p^ < aMin then aMin := p^;
> > inc(p);
> > end;

> begin

> if aArray[I] > aMax then
> aMax := aArray[I];
> if aArray[I] < aMin then
> aMin := aArray[I];
> end;
>

> begin
> Value := aArray[I];
> if Value > aMax then
> aMax := Value;
> if Value < aMin then
> aMin := Value;
> end;
> end;

I just coded all three versions in one program. Your code and my first
code produce exactly the same machine code. Not one single instruction
is different.

My second variety is a little simpler, but is not measurably faster or
slower.

--
Rudy Velthuis [TeamB] http://www.teamb.com

Fundamentalists: believe 2+2=5 because It Is Written. Somewhere.
They have a lot of trouble on their tax returns.

"Moderate" believers: live their lives on the basis that 2+2=4.
But go regularly to church to be told that 2+2 once made 5, or
will one day make 5, or in a very real and spiritual sense,
should make 5.

"Moderate" atheists: know that 2+2=4 but think it impolite to say
so too loudly as people who think 2+2=5 might be offended.

"Militant" atheists: "Oh for pity's sake. HERE. Two pebbles. Two
more pebbles. FOUR pebbles. What is WRONG with you people?"

Rudy Velthuis [TeamB]

unread,

Jul 5, 2008, 2:22:13 PM7/5/08

to

Nenad Trkulja wrote:

<snip>

> begin
> if (aArray[n] < aMin) or (aArray[n+1] < aMin) or
> (aArray[n+2] < aMin) or (aArray[n+3] < aMin) then
> begin
> if aArray[n] < aMin then
> aMin := aArray[n];
> if aArray[n+1] < aMin then
> aMin := aArray[n+1];
> if aArray[n+2] < aMin then
> aMin := aArray[n+2];
> if aArray[n+3] < aMin then
> aMin := aArray[n+3];
> end;

<snip>

That looks real cool, but is actually slightly slower than the code
Stig or I posted. I don't quite understand why you think that dividing
the array in 4 will make the code faster. The loop will be shorter, but
checking the loop variable is negligible compared to the comparisons
done in the loop.

I'll post my test to .attachments.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"Basically, I no longer work for anything but the sensation I
have while working."
-- Albert Giacometti (sculptor)

Bob Gonder

unread,

Jul 5, 2008, 2:32:48 PM7/5/08

to

Stig Johansen wrote:

>Since this apperently is performance critical, he could also just use a
>pointer, something like this:

I have no idea how optimal Delphi is, but this is what I'd do.

> procedure MinMaxArray( const aArray : Array of Integer; out aMax, aMin :
> integer );
> var i : integer;
> var p : pinteger ;
> begin
> aMax := minInt;
> aMin := maxInt;
> p := @aArray[0] ;

ECX = high(aArray)
PUSH ESI
MOV ESI,aArray
MOV EAX,[ESI+ECX*4]
{Might be off-by-one (ECX too high), not up on Delphi Arrays.}
MOV aMin,EAX
MOV aMax,EAX
The first element is both the initial min and max,
so just set it and remove from loop.
DEC ECX

> for i := 0 to high(aArray) do
> begin

AGAIN:
MOV EAX,[ESI+ECX*4]

> if p^ > aMax then aMax := p^ ;

CMP aMax,EAX
CMOVC aMax,EAX

> if p^ < aMin then aMin := p^;

CMP EAX,aMin
CMOVC aMin,EAX
> inc(p);
DEC ECX
> end;
JNZ AGAIN
>end;
POP ESI

The loop is so short it might benefit from unrolling.

Q Correll

unread,

Jul 5, 2008, 2:35:53 PM7/5/08

to

Rudy,

| I'm not sure if this is faster (I could imagine the optimizer already
| does this), but it could be:

Your thought being that Value would be in the local stack?

--
Q

07/05/2008 11:35:17

Clément Doss

unread,

Jul 5, 2008, 2:51:45 PM7/5/08

to

Hi JohnH,

>
> I doubt it! Usually when programmers are looking for
> the fastest execution of a small part of their program,
> their are looking in the wrong place to speed up the
> overall execution.

This routine is called a lot of times. In fact it's a part of a bigger project. I
don't know exactly how to explain it in English, but I'll try anyway.
My team must maintain an interface that will diagnose a hardware, in fact, we are
taking over a project that had some problems with the older team.
Anyway, while connected with the hardware, we read some data. Most of them are arrays
of double. More precisely "up" to a triplets of doubles.
The specification isn't clear about how large the arrays are. I guess they depend on
the hardware model. I've logged over +5000 elements in a single array. The values are
not "very big" nor very precise (up to 5 decimals). During the diagnostic, the
routines check for inconsistencies. Depending on the array position, we can call a
more specific diagnostic that will give us another array to check and so on.

We are not allowed to order the original array because we need to keep the index
exactly at the original position. I still have a lot of things to digest, and I just
haven't all the required time.

So far we managed to improve a lot over the older routines. The doubles can be
converted to integer (not even int64). There are a lot of functions doing operations
on those arrays. Most of them were declared like /somefunc(V:array of double)/ . Just
by redeclaring them to /somefunc( const V: array of double )/ helped a lot.
One of the most time consuming routines is the minmax functions. The time use to
convert the array to integer, and extract the minmax is a fraction of the time the
older routine used. I used the routine I posted here. Even the manager noticed the
improvement (WOW!!).
There's another procedure building a matrix, and doing some operations. Still haven't
got the time to check what operations exactly. But the declaration is also all wrong.
And I think I can improve the algorithm too. I must study better before doing those
changes.

That's why I posted in this newsgroup. I just want to be sure that the routine is as
fast as it can get.

Time to drink some coffee and read some more specs!

Clément

Q Correll

unread,

Jul 5, 2008, 2:38:25 PM7/5/08

to

Bob,

Reminds me of "the ol'days" in CLM/SDForum on CIS when we had the
"Assembler Challenges." <g>

--
Q

07/05/2008 11:37:21

Hubert Seidel

unread,

Jul 5, 2008, 3:57:55 PM7/5/08

to

Hi Rudy,

"Rudy Velthuis [TeamB]" <newsg...@rvelthuis.de> schrieb im Newsbeitrag
news:xn0fsbtfwfpwjp...@rvelthuis.de...
> Rudy Velthuis [TeamB] wrote:
...

> I just coded all three versions in one program. Your code and my first
> code produce exactly the same machine code. Not one single instruction
> is different.

(sorry, my english is not so good, i try it :)
i moved the access from parameters to local variables and .
A Test with three arrays (first has 50000 random-integers +/- 25000, next
has linear up 0,1,2,3,4,etc till 49999. and the last has linear down
0,-1,-2,-3 etc. till -49999) used per round averaged 10 cpu-clocks on
PIII-Cpu:

(*
procedure MinMaxArray(const aArray:array of integer; out aMax,
aMin:integer);
var // averaged 11 clocks per roundtrip
i:integer;
begin
aMax := aArray[Low(aArray)];
aMin := aMax;
for i := Low(aArray) + 1 to High(aArray) do
begin
if (aArray[i]<aMin) then aMin:=aArray[i];
if (aArray[i]>aMax) then aMax:=aArray[i];
end;
end;
*)

procedure MinMaxArray(const aArray:array of integer; out aMax,
aMin:integer);
var // averaged 10 clocks per roundtrip = 10% faster than above
i,mi,ma:integer;
begin
ma := aArray[Low(aArray)];
mi := ma;
for i := Low(aArray) + 1 to High(aArray) do
begin
if (aArray[i]<mi) then mi:=aArray[i];
if (aArray[i]>ma) then ma:=aArray[i];
end;
aMax := ma;
aMin := mi;
end;

mfg.
Herby

--
http://www.hubert-seidel.de

Rudy Velthuis [TeamB]

unread,

Jul 5, 2008, 5:08:46 PM7/5/08

to

Rudy Velthuis [TeamB] wrote:

> Q Correll wrote:
>
> > Rudy,
> >
> > > I'm not sure if this is faster (I could imagine the optimizer
> > > already does this), but it could be:
> >
> > Your thought being that Value would be in the local stack?
>

> Yes.
>
> Note that aMin and aMax are out variables, i.e. accessed indirectly.
> Hubert's idea of using local variables throughout (i.e. two local
> variables for min and max) is probably even faster. I just checked it
> (added a fifth variety to my test program), and indeed, it is 33%
> faster than using aMax and aMin.

Sorry, 50% faster (takes only 2/3 of the time).

--
Rudy Velthuis [TeamB] http://www.teamb.com

"Why don't they make the whole plane out of that black box stuff."
-- Steven Wright.

Rudy Velthuis [TeamB]

unread,

Jul 5, 2008, 5:03:16 PM7/5/08

to

Q Correll wrote:

> Rudy,
>
> > I'm not sure if this is faster (I could imagine the optimizer
> > already does this), but it could be:
>
> Your thought being that Value would be in the local stack?

Yes.

Note that aMin and aMax are out variables, i.e. accessed indirectly.
Hubert's idea of using local variables throughout (i.e. two local
variables for min and max) is probably even faster. I just checked it
(added a fifth variety to my test program), and indeed, it is 33%
faster than using aMax and aMin.

--

Rudy Velthuis [TeamB] http://www.teamb.com

"If Tyranny and Oppression come to this land, it will be in
the guise of fighting a foreign enemy."
-- James Madison

Pierre le Riche

unread,

Jul 5, 2008, 5:03:01 PM7/5/08

to

Hi Clément,

> Unfortunately it's not fast enough. I need more speed. I just wanted to
> be sure this routine is ok, or if there's some tricks to make it faster.

I haven't benchmarked it, but I reckon the routine below should be faster:

procedure MinMaxArray(const aArray: array of Integer; out aMax, aMin:
Integer);
var
LInd, LTempMin, LTempMax, LArrVal: Integer;
begin
LTempMin := MaxInt;
LTempMax := MinInt;
for LInd := 0 to High(aArray) do
begin
{Get the array value}
LArrVal := aArray[LInd];
{Update the minimum}
LTempMin := LArrVal + ((-Ord(LTempMin < LArrVal))
and (LTempMin - LArrVal));
{Update the maximum}
LTempMax := LArrVal + ((-Ord(LTempMax > LArrVal))
and (LTempMax - LArrVal));
end;
aMin := LTempMin;
aMax := LTempMax;
end;

Regards,
Pierre

Rudy Velthuis [TeamB]

unread,

Jul 5, 2008, 5:04:28 PM7/5/08

to

Hubert Seidel wrote:

> procedure MinMaxArray(const aArray:array of integer; out aMax,
> aMin:integer);
> var // averaged 10 clocks per roundtrip = 10% faster than above
> i,mi,ma:integer;
> begin
> ma := aArray[Low(aArray)];
> mi := ma;
> for i := Low(aArray) + 1 to High(aArray) do
> begin
> if (aArray[i]<mi) then mi:=aArray[i];
> if (aArray[i]>ma) then ma:=aArray[i];
> end;
> aMax := ma;
> aMin := mi;
> end;

It is 50% faster than Stig's or my variety! Very well!

--
Rudy Velthuis [TeamB] http://www.teamb.com

"A coward is a hero with a wife, kids, and a mortgage."
-- Marvin Kitman.

John Herbster

unread,

Jul 5, 2008, 5:18:07 PM7/5/08

to

"Clément Doss" <cd...@dhs.com.br> wrote
> ... This routine is called a lot of times. ... while connected with the hardware, we read some data.

Clément,

Why not pick out min/max as they are being read?

> Most of them are arrays of double. More precisely "up" to a triplets of doubles. The specification isn't clear about how large the arrays are. ... I've logged over +5000 elements in a single array. The values are not "very big" nor very precise (up to 5 decimals).

Only 5 decimals? Think about using single!

> ... We are not allowed to order the original array ...

> The doubles can be converted to integer (not even int64).

Doubles *can* be losslessly converted to Int64 and still retain their sort order!

function DoubleToSortableInt64(const dbl: double): Int64;
type tBA = array [0..SizeOf(dbl)-1] of byte;
var i64: Int64 absolute dbl; i: integer;
begin
Result := i64;
If (tBA(Result)[high(tBA)] and $80) = 0
then tBA(Result)[high(tBA)] := tBA(Result)[high(tBA)] xor $80
else for i := 0 to high(tBA) do tBA(Result)[i] := tBA(Result)[i];
end;

> ...

Hope that this helps, JohnH

John Herbster

unread,

Jul 5, 2008, 5:53:39 PM7/5/08

to

"John Herbster" <herb-sci1_AT_sbcglobal.net> wrote

> Result := i64;
> If (tBA(Result)[high(tBA)] and $80) = 0
> then tBA(Result)[high(tBA)] := tBA(Result)[high(tBA)] xor $80
> else for i := 0 to high(tBA) do tBA(Result)[i] := tBA(Result)[i];

Note correction below. --JohnH

function DoubleToSortableInt64(const dbl: double): Int64;
type tBA = array [0..SizeOf(dbl)-1] of byte;
var i64: Int64 absolute dbl; i: integer;
begin

If (tBA(i64)[high(tBA)] and $80) = 0
then tBA(Result) := tBA(i64)
else begin
tBA(Result)[high(tBA)] := tBA(i64)[high(tBA)] xor $7F;
for i := high(tBA)-1 downto 0 do tBA(Result)[i] := not tBA(i64)[i];
end;
end;

Rudy Velthuis [TeamB]

unread,

Jul 5, 2008, 5:42:34 PM7/5/08

to

Pierre le Riche wrote:

> procedure MinMaxArray(const aArray: array of Integer;
> out aMax, aMin: Integer);
> var
> LInd, LTempMin, LTempMax, LArrVal: Integer;
> begin
> LTempMin := MaxInt;
> LTempMax := MinInt;
> for LInd := 0 to High(aArray) do
> begin
> {Get the array value}
> LArrVal := aArray[LInd];
> {Update the minimum}
> LTempMin := LArrVal + ((-Ord(LTempMin < LArrVal))
> and (LTempMin - LArrVal));
> {Update the maximum}
> LTempMax := LArrVal + ((-Ord(LTempMax > LArrVal))
> and (LTempMax - LArrVal));
> end;
> aMin := LTempMin;
> aMax := LTempMax;
> end;

It is more than twice as slow as Stig's or my version (on my Core 2
Quad 2.4 GHz). I am not quite sure why. Rearranging the order of the
routines the calling code doesn't help.

MinMaxTest.dpr.150: LTempMin := LArrVal + ((-Ord(LTempMin < LArrVal))
and (LTempMin - LArrVal));
0040B1E1 3BC6 cmp eax,esi
0040B1E3 0F9FC3 setnle bl
0040B1E6 83E37F and ebx,$7f
0040B1E9 F7DB neg ebx
0040B1EB 2BF0 sub esi,eax
0040B1ED 23DE and ebx,esi
0040B1EF 03D8 add ebx,eax
0040B1F1 8BF3 mov esi,ebx
MinMaxTest.dpr.152: LTempMax := LArrVal + ((-Ord(LTempMax > LArrVal))
and (LTempMax - LArrVal));
0040B1F3 3BC1 cmp eax,ecx
0040B1F5 0F9CC3 setl bl
0040B1F8 83E37F and ebx,$7f
0040B1FB F7DB neg ebx
0040B1FD 2BC8 sub ecx,eax
0040B1FF 23D9 and ebx,ecx
0040B201 03C3 add eax,ebx
0040B203 8BC8 mov ecx,eax

--
Rudy Velthuis [TeamB] http://www.teamb.com

"And God said, 'Let there be light' and there was light, but the
Electricity Board said He would have to wait until Thursday to be
connected." -- Spike Milligan.

Rudy Velthuis [TeamB]

unread,

Jul 5, 2008, 6:00:49 PM7/5/08

to

Q Correll wrote:

> John,
>
> > (5) Why don't you determine the min and max
> > when the arrays are filled?
>
> I had the same wonder. <g> So I just assumed that perhaps he had no
> control over how/when the array was generated.
>
> However, I think my and Stig's suggestions combine for a pretty good
> optimization solution of Clément's posted code.

Hubert's code is still the fastest by 50%.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"Always go to other people's funerals, otherwise they won't come
to yours." -- Yogi Berra.

Rudy Velthuis [TeamB]

unread,

Jul 5, 2008, 6:07:19 PM7/5/08

to

Clément Doss wrote:

> Hi JohnH,
>
> >
> > I doubt it! Usually when programmers are looking for
> > the fastest execution of a small part of their program,
> > their are looking in the wrong place to speed up the
> > overall execution.
>
> This routine is called a lot of times.

How often? I mean, you'll have to fill the array first, and I can't
imagine you do that lots of times. And if you do, you might think of a
better approach to that.

Or, if you call it on the same array all the time, then keep the
variables and do not call the function except once.

Or, if you change certain values in the array, then check if the value
being overwritten is one of min or max and if so, determine min or max
again (but not both, except in the very special case that min = max).
Also check the new values and if necessary, update min and max
accordingly.

Or, if you can, keep the array sorted.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"It is the job of thinking people not to be on the side of the
executioners." -- Albert Camus

Mark Gohara

unread,

Jul 5, 2008, 6:09:57 PM7/5/08

to

Clément Doss wrote:
> Hi,
>

> Unfortunately it's not fast enough. I need more speed. I just wanted to
> be sure this routine is ok, or if there's some tricks to make it faster.

> If you guys think is a waste of time trying to improve on this routine,
> then I'll try to optimize other parts of the project.
>
> I appreciate very much your help.
>
> Clément
>
>
>
This may sound dumb and I have not code this but why not do a quick sort
and pick the first and last elements of the array?

Rudy Velthuis [TeamB]

unread,

Jul 5, 2008, 6:14:03 PM7/5/08

to

Mark Gohara wrote:

> This may sound dumb and I have not code this but why not do a quick
> sort and pick the first and last elements of the array?

Because a quicksort is still slower than doing two comparisons per
element, I'd guess.

But I would have two routines: one for the min, and one for the max,
and only call those when the corresponding value changes. It is useless
to also check for the max if the min value changes.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"Opportunities multiply as they are seized." -- Sun Tzu

Rudy Velthuis [TeamB]

unread,

Jul 5, 2008, 6:21:05 PM7/5/08

to

Rudy Velthuis [TeamB] wrote:

Hubert's:

MinMaxTest.dpr.130: if LArrVal > LTempMax then
MinMaxTest.dpr.131: LTempMax := LArrVal;
0040B1B3 3B55F8 cmp edx,[ebp-$08]
0040B1B6 7E03 jle $0040b1bb
0040B1B8 8955F8 mov [ebp-$08],edx
MinMaxTest.dpr.132: if LArrVal < LTempMin then
MinMaxTest.dpr.133: LTempMin := LArrVal;
0040B1BB 3BCA cmp ecx,edx
0040B1BD 7E02 jle $0040b1c1
0040B1BF 8BCA mov ecx,edx

--
Rudy Velthuis [TeamB] http://www.teamb.com

"#3 pencils and quadrille pads."
-- Seymoure Cray (1925-1996) when asked what CAD tools he used
to design the Cray I supercomputer; he also recommended using
the back side of the pages so that the lines were not so
dominant.

Clément Doss

unread,

Jul 5, 2008, 7:03:09 PM7/5/08

to

Hi all,

I really can't thank you enough for helping me with this one.

I have to come up with a short term solution. And replacing the minmax should come
first. On a second step I will improve the overall design. For example:
While converting the doubles to integer, I can get the max and min. But I still must
be sure that everything will work as expected. Once I understand better how this
project works, I am sure I can get rid of that calls. In that part of the program,
this routine is called 67%, followed by 58% of a matrix transformation routine.
I just can't imagine the performance boost if I replace the minmax routine with the
one you provided.

I took the code you all posted and update Rudy's benchmark test.
Those are the results...

Array0 36484396 - 19 9999995
Array1 35067716 - 19 9999995
Array2 32072704 - 19 9999995
Array3 18110452 - 19 9999995
Array4 49961416 - 19 9999995
Array5 105275380 - 19 9999995 --> Original code
Array6 63633496 - 19 9999995

Does this means that if I replace my code with the one in Array4, there will be a 2x
performance increase? Is that it?

Array3 is Hubert Seidel algorithm with a small change. I am working with local
variables iMin and iMax instead of aMin and aMax. In my machine it worked fast. I
guess I will have to bring this benchmark and run on the production machine to see
which one is better suited for the processor :D

I can't thank you enough for helping. I just hope you all enjoyed this Saturday
challenge! :D And if you keep lowering the cycles, I will most certainly test.

Clément.

Here is the code:

procedure MinMaxArray3(const aArray : Array of Integer; out aMax, aMin : integer );
var
i, m, mmod, n, iMax, iMin: Integer;

begin
aMax := 0;
aMin := 0;
if Length(aArray) > 0 then
begin

iMin := aArray[0];
iMax := aArray[0];

m := Length(aArray) div 4;
mmod := Length(aArray) mod 4;
n := 0;
for i := 0 to m -1 do
begin

if (aArray[n] < iMin) or (aArray[n+1] < iMin) or
(aArray[n+2] < iMin) or (aArray[n+3] < iMin) then
begin
if aArray[n] < iMin then
iMin := aArray[n];
if aArray[n+1] < iMin then
iMin := aArray[n+1];
if aArray[n+2] < iMin then
iMin := aArray[n+2];
if aArray[n+3] < iMin then
iMin := aArray[n+3];
end;

if (aArray[n] > iMax) or (aArray[n+1] > iMax) or
(aArray[n+2] > iMax) or (aArray[n+3] > iMax) then
begin
if aArray[n] > iMax then
iMax := aArray[n];
if aArray[n+1] > iMax then
iMax := aArray[n+1];
if aArray[n+2] > iMax then
iMax := aArray[n+2];
if aArray[n+3] > iMax then
iMax := aArray[n+3];
end;

Inc(n, 4);
end;

for i := 0 to mmod-1 do
begin

if aArray[i+n] < iMin then
iMin := aArray[i+n];
if aArray[i+n] > iMax then
iMax := aArray[i+n];
end;
aMax := iMax;
aMin := iMin;
end;
end;

Dan Downs

unread,

Jul 5, 2008, 7:22:11 PM7/5/08

to

> That looks real cool, but is actually slightly slower than the code
> Stig or I posted. I don't quite understand why you think that dividing
> the array in 4 will make the code faster. The loop will be shorter, but
> checking the loop variable is negligible compared to the comparisons
> done in the loop.

I've found loop unrolling to be faster in the past (until the K8 and
P4), but current processors minimize the jump in the loop so much that
it doesn't really do the trick anymore. I unrolled Clement's original
code by 8 and got ~10% improvement on an array of 50000 integers. But
Stig's code was much faster, unrolling it gave my nothing extra and some
times slower.

testing on a core 2 duo 2.16ghz, d2007

DD

Rudy Velthuis [TeamB]

unread,

Jul 5, 2008, 7:34:22 PM7/5/08

to

Dan Downs wrote:

> > That looks real cool, but is actually slightly slower than the code
> > Stig or I posted. I don't quite understand why you think that
> > dividing the array in 4 will make the code faster. The loop will be
> > shorter, but checking the loop variable is negligible compared to
> > the comparisons done in the loop.
>
> I've found loop unrolling to be faster in the past (until the K8 and
> P4), but current processors minimize the jump in the loop so much
> that it doesn't really do the trick anymore.

Trkulja's loop unrolling code, if modified to be using local max and
min values (like Seidel does), is the fastest, but if it doesn't use
local min and max, it is a little slower than normal loops. Clèment
posted the code in his latest post.

So loop unrolling can be faster, but apparently only under certain
conditions.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"When you have to kill a man, it costs nothing to be polite."
-- Sir Winston Churchill (1874-1965)

Q Correll

unread,

Jul 5, 2008, 7:35:18 PM7/5/08

to

Rudy,

| Hubert's idea of using local variables throughout (i.e. two local
| variables for min and max) is probably even faster. I just checked it
| (added a fifth variety to my test program), and indeed, it is 33%
| faster than using aMax and aMin.

That's something I'm not sure I would have thought of or about.

Good idea!

Thanks!

--
Q

07/05/2008 16:33:45

Q Correll

unread,

Jul 5, 2008, 7:35:18 PM7/5/08

to

Hubert,

Excellent!

--
Q

07/05/2008 16:35:12

Q Correll

unread,

Jul 5, 2008, 7:36:21 PM7/5/08

to

Rudy,

| Hubert's code is still the fastest by 50%.

Yep, I saw that. Thanks!

--
Q

07/05/2008 16:36:08

Rudy Velthuis [TeamB]

unread,

Jul 5, 2008, 7:44:09 PM7/5/08

to

Q Correll wrote:

> Rudy,
>
> > Hubert's code is still the fastest by 50%.
>
> Yep, I saw that. Thanks!

Not anymore. A combination of Trkulja and Seidel is the fastest now.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"Vote early and vote often." -- Al Capone (1899-1947)

Q Correll

unread,

Jul 5, 2008, 7:44:37 PM7/5/08

to

Clément,

| ...followed by 58% of a matrix transformation routine.

I don't know your application. So you can ignore me if it appears that I
don't know what I'm talking about with regards to your app. <g>

In some types of matrix transformation routines, such as optimizations
like the Simplex algorithm, the index of the pivot element is important to
know in addition to the value.

--
Q

07/05/2008 16:40:20

Nenad Trkulja

unread,

Jul 5, 2008, 8:03:04 PM7/5/08

to

Rudy Velthuis [TeamB] wrote:

> That looks real cool, but is actually slightly slower than the code
> Stig or I posted. I don't quite understand why you think that dividing
> the array in 4 will make the code faster. The loop will be shorter, but
> checking the loop variable is negligible compared to the comparisons
> done in the loop.

> but is actually slightly slower than the code Stig or I posted.

Slower? Take a look in attachments: MinMaxTest2.dpr

some results: (last row is MinMaxArray3)

P4 3.4
0,211312 s
0,206404 s
0,207032 s
0,176657 s

Athlon 64 X2 4400+
0,320580 s
0,318645 s
0,320272 s
0,202763 s

Xeon 1.8 - 2x2
0,395145 s
0,382383 s
0,387549 s
0,345436 s

Pentium M 1.6
0,564841 s
0,479907 s
0,465793 s
0,416627 s

P3 1.0 (dual box)
2,098975 s
2,116293 s
2,096372 s
1,998513 s

> I don't quite understand why you think that dividing the array in 4 will
> make the code faster

http://en.wikipedia.org/wiki/Loop_unrolling

>The loop will be shorter, but
> checking the loop variable is negligible compared to the comparisons
> done in the loop.

Assignment costs much more, or not?

>I'll post my test to .attachments

Calling a test function only 4 times does not give correct results.

Hubert Seidel

unread,

Jul 5, 2008, 8:22:52 PM7/5/08

to

Hello,

"Q Correll" <qcor...@pacNObell.net> schrieb im Newsbeitrag
news:48700536$3...@newsgroups.borland.com...

Than you :-)

I am encouraged to write two versions in BASM. On my PIII both are only a
litle bit faster.
But not mensurable differences between both versions on my hardware:

version 1:

procedure MinMaxArray(const aArray:array of integer; out aMax,
aMin:integer);

asm
push ebx
push esi
push edi

mov ebx, ecx
mov ecx, [eax]
mov edi, ecx

test edx, edx
jle @@Output // If only one item //

@@LpBegin:
mov esi, [eax]

@@CheckMin:
cmp esi, edi
jnl @@CheckMax
mov edi, esi // Store in EDI if < //

@@CheckMax:
cmp ecx, esi
jnl @@LpEnd
mov ecx, esi // Store in ECX if > //

@@LpEnd:
add eax, 4
dec edx
jnz @@LpBegin

@@Output:
mov eax, [ebp+$08]
mov [ebx], ecx // Output aMin //
mov [eax], edi // Output aMax //

pop edi
pop esi
pop ebx
end;

=============================
version 2:

procedure MinMaxArray(const aArray:array of integer; out aMax,
aMin:integer);

asm
push ebx
push esi
push edi
push ebp

mov ebx, ecx
mov ecx, [eax]
mov edi, ecx

test edx, edx
jle @@Output // If only one item //

@@LpBeginA:
mov esi, [eax]
mov ebp, [eax+4]

@@CheckMinA:
cmp esi, edi
jnl @@CheckMaxA
mov edi, esi // Store in EDI if < //

@@CheckMaxA:
cmp ecx, esi
jnl @@LpEndA
mov ecx, esi // Store in ECX if > //

@@LpEndA:
dec edx
jz @@Output

@@CheckMinB:
cmp ebp, edi
jnl @@CheckMaxB
mov edi, ebp // Store in EDI if < //

@@CheckMaxB:
cmp ecx, ebp
jnl @@LpEndB
mov ecx, ebp // Store in ECX if > //

@@LpEndB:
add eax, 8
dec edx
jnz @@LpBeginA

@@Output:
pop ebp

mov eax, [ebp+$08]
mov [ebx], ecx // Output aMin //
mov [eax], edi // Output aMax //

pop edi
pop esi
pop ebx
end;

i am interested at test results with another hardware.

Rudy Velthuis [TeamB]

unread,

Jul 5, 2008, 8:31:34 PM7/5/08

to

Nenad Trkulja wrote:

Core 4 Quad 2.4:

0,186393 s
0,183025 s
0,306794 s
0,194925 s

And I have absolutely no idea why MinMaxArray2 is so darned slow on
this machine (it hardly differs from MinMaxArray1 - only the use of one
extra local variable). I know how loop unrolling works, but that
doesn't mean it always makes sense. Your constant fetching from the
array, instead of comapring against one value and the fact that you
can't store intermediates means not all data can be done in registers
might slow things down a bit. Apparently it does, on my machine.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"The wit makes fun of other persons; the satirist makes fun of
the world; the humorist makes fun of himself."
-- James Thurber (1894-1961),
in Edward R. Murrow television interview

Rudy Velthuis [TeamB]

unread,

Jul 5, 2008, 9:03:28 PM7/5/08

to

Nenad Trkulja wrote:

> > The loop will be shorter, but
> > checking the loop variable is negligible compared to the comparisons
> > done in the loop.
>
> Assignment costs much more, or not?

Not if it is to another register

> > I'll post my test to .attachments
>
> Calling a test function only 4 times does not give correct results.

OK. And I read that on modern processors like mine, RDTSC is unreliable
as well.

I modified my setup to make it a bit more like yours and added a few
routines. Take a look at MinMaxTest4.dpr in attachments.

My output:

Original code
0,798931 s

Using pointers
0,212360 s

Using indices
0,215502 s

Reading array value once
0,206537 s

Nenad Trkulja - loop unrolling
0,263419 s

Hubert Seidel - local min and max
0,162074 s

Pierre le Riche - boolean trick
0,502107 s

Trkulja/Seidel - loop unrolling and local min and max
0,136875 s

As you can see, on my system, loop unrolling alone is even slower, but
combining it with a local min and max values is faster. I repeated the
test many times, and keep on getting similar results.

But I wonder why in your program, MinMaxArray2 is much slower than the
rest, while in my program, it isn't. AFAICS, it is the same code.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"A terrorist is someone who has a bomb, but doesn't have an
air force." -- William Blum

Nenad Trkulja

unread,

Jul 5, 2008, 9:31:40 PM7/5/08

to

Rudy Velthuis [TeamB] wrote:

>And I have absolutely no idea why MinMaxArray2 is so darned slow on
>this machine

Nider do I.

The only difference I can see is that the first uses jnl inside a loop and
2nd jle..have no idea what it dooes :)

>Your constant fetching from the
> array, instead of comapring against one value and the fact that you
> can't store intermediates means not all data can be done in registers
> might slow things down a bit. Apparently it does, on my machine.

I was trying to reduce the number of assignments.
But that depends on the values in the array and how they are spread.

Stig Johansen

unread,

Jul 6, 2008, 3:15:59 AM7/6/08

to

"Rudy Velthuis [TeamB]" <newsg...@rvelthuis.de> wrote in message
news:xn0fsceh7g1c0l...@rvelthuis.de...

> Dan Downs wrote:
>
> > I've found loop unrolling to be faster in the past (until the K8 and
> > P4), but current processors minimize the jump in the loop so much
> > that it doesn't really do the trick anymore.

I tried the maxtest2.dpr, ans found some interesting things (for me).
(Using D7 on an AMD Athlon 850 MHz.

> Trkulja's loop unrolling code, if modified to be using local max and
> min values (like Seidel does), is the fastest, but if it doesn't use
> local min and max, it is a little slower than normal loops. Clèment
> posted the code in his latest post.

I tried changing my own code to use local in and max. There were virtually
no difference (2 on the fourth decimal).

> So loop unrolling can be faster, but apparently only under certain
> conditions.

Then i tried 'unrolling' with the following construct (A little Q&D, but
just to see the diff):
TYPE
tIa = PACKED ARRAY [0..3] OF INTEGER ;
tpIa = ^tIa ;
var
I: Integer;
P: PInteger;
pIa : tPIa ;
iMin,iMax: INTEGER ;
begin
iMax := MinInt;
iMin := MaxInt;
pIa :=@aArray[0];
for I := 0 to High(aArray) DIV 4 do
begin
if ( PIa^[0] > iMax ) OR ( PIa^[1] > iMax ) OR ( PIa^[2] > iMax ) OR (
PIa^[3] > iMax ) then BEGIN
IF PIa^[0] > iMax THEN
iMax := PIa^[0];
IF PIa^[1] > iMax THEN
iMax := PIa^[1];
IF PIa^[2] > iMax THEN
iMax := PIa^[2];
IF PIa^[3] > iMax THEN
iMax := PIa^[3];
END ;
if ( PIa^[0] < iMin ) OR ( PIa^[1] < iMin ) OR ( PIa^[2] < iMin ) OR (
PIa^[3] < iMin ) then BEGIN
IF PIa^[0] < iMin THEN
iMin := PIa^[0];
IF PIa^[1] < iMin THEN
iMin := PIa^[1];
IF PIa^[2] < iMin THEN
iMin := PIa^[2];
IF PIa^[3] < iMin THEN
iMin := PIa^[3];
END ;
Inc(PIa);
end;
aMax := iMin;
aMin := iMax;
end;

This gave some performance improvements (loop = 10 times):
0,288898 s
0,243985 s

Then i tried 'unrolling' with an array of 8, at then it was:
0,288898 s-mine
0,243985 s-mine 4
0,236794 s-mine-8
0,289013 s - unmodified
0,289026 s - unmodified
0,246168 s - unmodified

So it seems that to a certainn point, there could be some gain in unrolling.
But this is old HW, so it could be changed.

--
Med venlig hilsen/Best regards
Stig Johansen

Pierre le Riche

unread,

Jul 6, 2008, 4:17:10 AM7/6/08

to

Hi Rudy,

> It is more than twice as slow as Stig's or my version (on my Core 2
> Quad 2.4 GHz). I am not quite sure why. Rearranging the order of the
> routines the calling code doesn't help.

Interesting. Did you test with completely random input, or monotonically
increasing or decreasing data? The latter would favour routines
containing branches.

I figured that eliminating the branches would help reduce the impact of
branch mispredictions. I guess with a large array as input the min & max
values quickly converge and the branches become highly predictable,
making them cheap.

On a CPU with a longer pipeline, like the P4, and with branch predictor
"unfriendly" input it will do better. I doubt it will be able to make up
the 2x deficit, however.

Regards,
Pierre

Hubert Seidel

unread,

Jul 6, 2008, 6:44:04 AM7/6/08

to

"Hubert Seidel" <nos...@hubert-seidel.de> schrieb im Newsbeitrag
news:g4p39k$tsp$02$1...@news.t-online.com...

> "Q Correll" <qcor...@pacNObell.net> schrieb im Newsbeitrag

...

> > Excellent!
> Than you :-)
>
> I am encouraged to write two versions in BASM. On my PIII both are only a
> litle bit faster.

sorry, there was a mistake in the last BASM-function :-(
The loop leave one time too early! So the last value not incuded!
Here ist the correction and an new third version with mnemonic "cmovl".

version 1:
==========================================================

procedure MinMaxArray(const aArray:array of integer; out aMax,
aMin:integer);
asm
push ebx
push esi
push edi

mov ebx, ecx
mov ecx, [eax]
mov edi, ecx

inc edx

test edx, edx
jle @@Output

@@LpBegin:
mov esi, [eax]

@@CheckMin:
cmp esi, edi
jnl @@CheckMax
mov edi, esi // Store in EDI if < //

@@CheckMax:
cmp ecx, esi
jnl @@LpEnd
mov ecx, esi // Store in ECX if > //

@@LpEnd:
add eax, 4
dec edx
jnz @@LpBegin

@@Output:
mov eax, [ebp+$08]
mov [ebx], ecx // Output aMin //
mov [eax], edi // Output aMax //

pop edi
pop esi
pop ebx
end;

version 2 (cmov and unrolling in basm):
==========================================================

procedure MinMaxArray(const aArray:array of integer; out aMax,
aMin:integer);
asm
push ebx
push esi
push edi
push ebp

mov ebx, ecx
mov ecx, [eax]
mov edi, ecx

inc edx

test edx, edx
jle @@Output

@@LpBeginA:

@@Output:
pop ebp

version 3:
==========================================================

procedure MinMaxArray(const aArray:array of integer; out aMax,
aMin:integer);
asm
push ebx
push esi
push edi
push ebp

mov ebx, ecx
mov ecx, [eax]
mov edi, ecx

inc edx

test edx, edx
jle @@Output

mov ebp, edx // ebp := edx mod 4
and ebp, 3
shr edx, 2 // edx := edx div 4

test edx, edx
jle @@RestMod4

@@LpMainBegin:
mov esi, [eax]
cmp esi, edi
db $0F,$4C,$FE { cmovl edi, esi }
cmp ecx, esi
db $0F,$4C,$CE { cmovl ecx, esi }

mov esi, [eax+4]
cmp esi, edi
db $0F,$4C,$FE { cmovl edi, esi }
cmp ecx, esi
db $0F,$4C,$CE { cmovl ecx, esi }

mov esi, [eax+8]
cmp esi, edi
db $0F,$4C,$FE { cmovl edi, esi }
cmp ecx, esi
db $0F,$4C,$CE { cmovl ecx, esi }

mov esi, [eax+12]
cmp esi, edi
db $0F,$4C,$FE { cmovl edi, esi }
cmp ecx, esi
db $0F,$4C,$CE { cmovl ecx, esi }

@@LpMainEnd:
add eax, 16
dec edx
jnz @@LpMainBegin

@@RestMod4:

test ebp, ebp
jle @@Output

@@LpRestBegin:
mov esi, [eax]
cmp esi, edi
db $0F,$4C,$FE { cmovl edi, esi }
cmp ecx, esi
db $0F,$4C,$CE { cmovl ecx, esi }

@@LpRestEnd:
add eax, 4
dec ebp
jnz @@LpRestBegin

@@Output:
pop ebp

mov eax, [ebp+$08]
mov [ebx], ecx // Output aMin //
mov [eax], edi // Output aMax //

pop edi
pop esi
pop ebx
end;

have fun.

Stig Johansen

unread,

Jul 6, 2008, 8:28:21 AM7/6/08

to

"Clément Doss" <cd...@dhs.com.br> wrote in message
news:486ffdd6$1...@newsgroups.borland.com...

> And if you keep lowering the cycles, I will most certainly test.

Hmm. can't resist .... here it a dialog from the movie TRON between MCP and
Sark:

Master Control Program: I've got a little challenge for you, SARK - a new
recruit. He's a tough case, but I want him treated in the usual manner.
Train him for the games... let him hope for a while... and blow him away.

SARK: You've got it. I've been hopin' you'd send me somebody with a little
more guts... what kind of program is he?

Master Control Program: He's not any kind of program, SARK. He's a User.

SARK: A user?
Master Control Program: That's right. He pushed me... in the other world.
Somebody pushes me, I push back. So I brought him down here... What's the
matter, SARK? You look nervous.

SARK: Well, I - it's just - I don't know, a User, I mean... Users wrote us.
A User even wrote you...

Master Control Program: No one User wrote me. I'm worth a couple million of
their man-years!

SARK: But-what if I can't...?

Master Control Program: You rather take your chances with me? Want me to
*slow* *down* your power cycles for you?

SARK: Wait... I *need* that...

Master Control Program: Then pull yourself together. Get this clown trained.
I want him in the Games until he dies playing

(*'s added by me:-)

Rudy Velthuis [TeamB]

unread,

Jul 6, 2008, 10:33:19 AM7/6/08

to

Pierre le Riche wrote:

> Hi Rudy,
>
> > It is more than twice as slow as Stig's or my version (on my Core 2
> > Quad 2.4 GHz). I am not quite sure why. Rearranging the order of the
> > routines the calling code doesn't help.
>
> Interesting. Did you test with completely random input, or
> monotonically increasing or decreasing data?

With monotonously increasing and with random data. No measurable
difference.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"I would have made a good Pope." -- Richard Nixon.

Rudy Velthuis [TeamB]

unread,

Jul 6, 2008, 10:31:15 AM7/6/08

to

Stig Johansen wrote:

> I tried changing my own code to use local in and max. There were
> virtually no difference (2 on the fourth decimal).

There was quite a difference on my system. I posted the results.

> > So loop unrolling can be faster, but apparently only under certain
> > conditions.
>
> Then i tried 'unrolling' with the following construct (A little Q&D,
> but just to see the diff):

You are using pointers again. This not only makes the code harder to
read, it is not necessary either.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"Every journalist has a novel in him, which is an excellent
place for it."

Rudy Velthuis [TeamB]

unread,

Jul 6, 2008, 10:36:08 AM7/6/08

to

Pierre le Riche wrote:

> Hi Rudy,
>
> > It is more than twice as slow as Stig's or my version (on my Core 2
> > Quad 2.4 GHz). I am not quite sure why. Rearranging the order of the
> > routines the calling code doesn't help.
>
> Interesting. Did you test with completely random input, or
> monotonically increasing or decreasing data? The latter would favour
> routines containing branches.
>
> I figured that eliminating the branches would help reduce the impact

> of branch mispredictions. Yes, but that means that, for min or max,
you do a comparison (IOW, a subtraction), a subtraction, an AND and an
addition in EVERY loop. The loop using branching only does the
comparison. Also the -Ord(LArrVal < LTempMin) part costs time. I guess
that is too much.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"Wit makes its own welcome and levels all distinctions."
-- Ralph Waldo Emerson (1803-1882)

Stig Johansen

unread,

Jul 6, 2008, 11:11:45 AM7/6/08

to

Rudy Velthuis [TeamB] wrote:

> You are using pointers again. This not only makes the code harder to
> read, it is not necessary either.

Bad habbit, maybe, depending on the eyes that sees.
I've been using Pascal since '82-'83, so a big part of my 'Pascal life' is
used to using pointers.

But anyway, i posted a 'v5' version in .attachments

--
Best regards
Stig Johansen

Rudy Velthuis [TeamB]

unread,

Jul 6, 2008, 11:16:48 AM7/6/08

to

Stig Johansen wrote:

> Rudy Velthuis [TeamB] wrote:
>
> > You are using pointers again. This not only makes the code harder to
> > read, it is not necessary either.
>
> Bad habbit, maybe, depending on the eyes that sees.
> I've been using Pascal since '82-'83, so a big part of my 'Pascal
> life' is used to using pointers.

Sure, but why use pointers when one can avoid them?

--
Rudy Velthuis [TeamB] http://www.teamb.com

"Assassins!"
-- Arturo Toscanini (1867-1957) to his orchestra

Stig Johansen

unread,

Jul 6, 2008, 11:21:17 AM7/6/08

to

Rudy Velthuis [TeamB] wrote:

> Stig Johansen wrote:
>> Bad habbit, maybe, depending on the eyes that sees.
>> I've been using Pascal since '82-'83, so a big part of my 'Pascal
>> life' is used to using pointers.
>
> Sure, but why use pointers when one can avoid them?
>

He - good question, but: Why not use pointers when you are used to it?
To be honest, i didn't know the optimizer was that good, so i just assumed
that pointers were faster :-)

Rudy Velthuis [TeamB]

unread,

Jul 6, 2008, 11:25:16 AM7/6/08

to

Stig Johansen wrote:

> Rudy Velthuis [TeamB] wrote:
>
> > Stig Johansen wrote:
> >> Bad habbit, maybe, depending on the eyes that sees.
> >> I've been using Pascal since '82-'83, so a big part of my 'Pascal
> >> life' is used to using pointers.
> >
> > Sure, but why use pointers when one can avoid them?
> >
>
> He - good question, but: Why not use pointers when you are used to it?

That is only not a problem if you are the only programmer to ever see
and use your code, and no one needs the code anymore if you, say, come
under a bus and are in a coma for 3 years. <g>

--
Rudy Velthuis [TeamB] http://www.teamb.com

"I think there is a world market for maybe five computers."
-- Thomas Watson (1874-1956), Chairman of IBM, 1943

Hubert Seidel

unread,

Jul 6, 2008, 11:41:19 AM7/6/08

to

Here for now my last basm-version v4:

"Hubert Seidel" <nos...@hubert-seidel.de> schrieb im

Newsbeitrag news:g4q7mf$sf6$03$1...@news.t-online.com (here are v1 to v3 of
basm-versions)...

I think the RAM-throughput is the bottlenack now.
With the new function v4, i only increase the performance a little.
Background of changes:
- besser use the U- und V-Pipeline simultaneous (i hope :)
- rollup 8
- check with 2 data-types and 4 different values

procedure MinMaxArray(const aArray:array of integer; out aMax,
aMin:integer);

asm // in the hope to use a little bit better U- and V-pipeline //

push ebx
push esi
push edi

push ecx
push ebp

mov ebx, [eax]
mov edi, ebx

inc edx
test edx, edx
jle @@Output

mov ebp, edx // ebp := edx mod 4

and ebp, 7
shr edx, 3 // edx := edx div 4

test edx, edx
jle @@RestMod4

mov esi, [eax]
@@LpMainBegin:
mov ecx, [eax+4]

cmp esi, edi
db $0F,$4C,$FE { cmovl edi, esi }

cmp ebx, esi
db $0F,$4C,$DE { cmovl ebx, esi }

mov esi, [eax+8]
cmp ecx, edi
db $0F,$4C,$F9 { cmovl edi, ecx }
cmp ebx, ecx
db $0F,$4C,$D9 { cmovl ebx, ecx }

mov ecx, [eax+12]

cmp esi, edi
db $0F,$4C,$FE { cmovl edi, esi }

cmp ebx, esi
db $0F,$4C,$DE { cmovl ebx, esi }

mov esi, [eax+16]
cmp ecx, edi
db $0F,$4C,$F9 { cmovl edi, ecx }
cmp ebx, ecx
db $0F,$4C,$D9 { cmovl ebx, ecx }

mov ecx, [eax+20]

cmp esi, edi
db $0F,$4C,$FE { cmovl edi, esi }

cmp ebx, esi
db $0F,$4C,$DE { cmovl ebx, esi }

mov esi, [eax+24]
cmp ecx, edi
db $0F,$4C,$F9 { cmovl edi, ecx }
cmp ebx, ecx
db $0F,$4C,$D9 { cmovl ebx, ecx }

mov ecx, [eax+28]
add eax, 32

cmp esi, edi
db $0F,$4C,$FE { cmovl edi, esi }

cmp ebx, esi
db $0F,$4C,$DE { cmovl ebx, esi }

mov esi, [eax]
cmp ecx, edi
db $0F,$4C,$F9 { cmovl edi, ecx }
cmp ebx, ecx
db $0F,$4C,$D9 { cmovl ebx, ecx }

@@LpMainEnd:
dec edx
jnz @@LpMainBegin

@@RestMod4:

test ebp, ebp
jle @@Output

@@LpRestBegin:
mov esi, [eax]
cmp esi, edi
db $0F,$4C,$FE { cmovl edi, esi }

cmp ebx, esi
db $0F,$4C,$DE { cmovl ebx, esi }

@@LpRestEnd:
add eax, 4
dec ebp
jnz @@LpRestBegin

@@Output:
pop ebp
pop ecx

mov eax, [ebp+$08]
mov [ecx], ebx // Output aMin //

mov [eax], edi // Output aMax //

pop edi
pop esi
pop ebx
end;

=====================

t2=PIII-cpu-clocks (single-cpu with rdtsc), s=seconds

in large Arrays e.g.: array[0..50000-1] of integer;

Info: "Test 00" Array-Values all 50000 Values zero 0
--------------------------------------------------
BASM normal ..................v1.. : t2=350029, s=0,364
BASM unroll 2 ................v2.. : t2=325044, s=0,345
BASM cmovl unroll 4 ..........v3.. : t2=292043, s=0,315
BASM cmovl unroll 8 uv-pipe ..v4.. : t2=268751, s=0,285
Mixtur ........................... : t2=250411, s=0,270
===============================================================

Info: "Test 01" Array-Values random +/- 25000
--------------------------------------------------
BASM normal ..................v1.. : t2=350316, s=0,375
BASM unroll 2 ................v2.. : t2=325308, s=0,340
BASM cmovl unroll 4 ..........v3.. : t2=291383, s=0,310
BASM cmovl unroll 8 uv-pipe ..v4.. : t2=268767, s=0,280
Mixtur ........................... : t2=250899, s=0,264
===============================================================

Info: "Test 02" Array-Values 0,1,2,3,4,...,49999
--------------------------------------------------
BASM normal ..................v1.. : t2=299421, s=0,320
BASM unroll 2 ................v2.. : t2=269221, s=0,285
BASM cmovl unroll 4 ..........v3.. : t2=291699, s=0,320
BASM cmovl unroll 8 uv-pipe ..v4.. : t2=268763, s=0,280
Mixtur ........................... : t2=300233, s=0,315
===============================================================

Info: "Test 03" Array-Values 0,-1,-2,-3,-4,...,-49999
--------------------------------------------------
BASM normal ..................v1.. : t2=312629, s=0,330
BASM unroll 2 ................v2.. : t2=272050, s=0,295
BASM cmovl unroll 4 ..........v3.. : t2=291355, s=0,310
BASM cmovl unroll 8 uv-pipe ..v4.. : t2=268767, s=0,285
Mixtur ........................... : t2=275613, s=0,290
===============================================================

in small Arrays e.g.: array[0..4096-1] of integer;

Info: "Test 00" Array-Values all 50000 Values zero 0
--------------------------------------------------
BASM normal ..................v1.. : t2=24636, s=0,025
BASM unroll 2 ................v2.. : t2=20600, s=0,020
BASM cmovl unroll 4 ..........v3.. : t2=17691, s=0,020
BASM cmovl unroll 8 uv-pipe ..v4.. : t2=16122, s=0,020
Mixtur ........................... : t2=17270, s=0,016
===============================================================

Info: "Test 01" Array-Values random +/- 25000
--------------------------------------------------
BASM normal ..................v1.. : t2=24810, s=0,025
BASM unroll 2 ................v2.. : t2=20782, s=0,025
BASM cmovl unroll 4 ..........v3.. : t2=17704, s=0,021
BASM cmovl unroll 8 uv-pipe ..v4.. : t2=16120, s=0,015
Mixtur ........................... : t2=17590, s=0,020
===============================================================

Info: "Test 02" Array-Values 0,1,2,3,4,...,4095
--------------------------------------------------
BASM normal ..................v1.. : t2=20770, s=0,020
BASM unroll 2 ................v2.. : t2=20649, s=0,026
BASM cmovl unroll 4 ..........v3.. : t2=17676, s=0,014
BASM cmovl unroll 8 uv-pipe ..v4.. : t2=16126, s=0,015
Mixtur ........................... : t2=18868, s=0,020
===============================================================

Info: "Test 03" Array-Values 0,-1,-2,-3,-4,...,-4095
--------------------------------------------------
BASM normal ..................v1.. : t2=17728, s=0,020
BASM unroll 2 ................v2.. : t2=16941, s=0,019
BASM cmovl unroll 4 ..........v3.. : t2=17704, s=0,021
BASM cmovl unroll 8 uv-pipe ..v4.. : t2=16120, s=0,015
Mixtur ........................... : t2=18332, s=0,022
===============================================================

Rudy Velthuis [TeamB]

unread,

Jul 6, 2008, 11:39:27 AM7/6/08

to

Rudy Velthuis [TeamB] wrote:

> Stig Johansen wrote:
>
> > I tried changing my own code to use local in and max. There were
> > virtually no difference (2 on the fourth decimal).
>
> There was quite a difference on my system. I posted the results.
>
> > > So loop unrolling can be faster, but apparently only under certain
> > > conditions.
> >
> > Then i tried 'unrolling' with the following construct (A little Q&D,
> > but just to see the diff):

Your newest is fastest on my system:

Original code
0,798055 s
Using pointers
0,186170 s
Using pointers (8/loop)
0,115466 s
Using indices
0,184385 s
Reading array value once
0,181304 s

Nenad Trkulja - loop unrolling

0,261572 s

Hubert Seidel - local min and max

0,161570 s

Pierre le Riche - boolean trick

0,502879 s

Trkulja/Seidel - loop unrolling and local min and max

0,134098 s

It looks a lot like Trkulja/Seidel, but now with 8-fold loop unrolling
instead of 4. I tried 16-fold, and got values like

Using pointers (16/loop)
0,106721 s

I doubt 32 will make a lot of difference. <g>

--
Rudy Velthuis [TeamB] http://www.teamb.com

"Why yes -- a bulletproof vest."
-- James Rodges, murderer, on his final request before the
firing squad.

Stig Johansen

unread,

Jul 6, 2008, 11:43:18 AM7/6/08

to

Rudy Velthuis [TeamB] wrote:

> Stig Johansen wrote:
>
>> I tried changing my own code to use local in and max. There were
>> virtually no difference (2 on the fourth decimal).
>
> There was quite a difference on my system. I posted the results.

BTW, do you think it's because[0] of the HW[1] or the optimizer (I use D7).

[0] Both the difference in unrolling and the diff in using local vars.
[1] It look's like i only have 64KB L1 and 256KB L2 cache.

Rudy Velthuis [TeamB]

unread,

Jul 6, 2008, 11:50:37 AM7/6/08

to

Stig Johansen wrote:

ISTM that this is very much depending on the CPU and the other
hardware. I have a 2.4 GHz Core 2 Quad.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"Some cause happiness wherever they go; others, whenever they go."
-- Oscar Wilde (1854-1900)

Stig Johansen

unread,

Jul 6, 2008, 12:03:57 PM7/6/08

to

"Rudy Velthuis [TeamB]" <newsg...@rvelthuis.de> wrote in message

news:xn0fsd3vxgzt2j...@rvelthuis.de...

> Rudy Velthuis [TeamB] wrote:
> Your newest is fastest on my system:
>
> Original code
> 0,798055 s
> Using pointers
> 0,186170 s
> Using pointers (8/loop)
> 0,115466 s
> Using indices
> 0,184385 s
> Reading array value once
> 0,181304 s
> Nenad Trkulja - loop unrolling
> 0,261572 s
> Hubert Seidel - local min and max
> 0,161570 s
> Pierre le Riche - boolean trick
> 0,502879 s
> Trkulja/Seidel - loop unrolling and local min and max
> 0,134098 s

That's quite some differences there, i got this:
Original code
3,977243 s
Using pointers
3,168108 s
Using pointers (8/loop)
2,560756 s
Using indices
3,176681 s
Reading array value once
3,133010 s

Nenad Trkulja - loop unrolling

3,023329 s

Hubert Seidel - local min and max

3,166742 s

Pierre le Riche - boolean trick

3,635602 s

Trkulja/Seidel - loop unrolling and local min and max

2,682598 s

So the differences (in %) is not that big compared to yours system, or is it
Delphi?

> It looks a lot like Trkulja/Seidel, but now with 8-fold loop unrolling
> instead of 4. I tried 16-fold, and got values like
>
> Using pointers (16/loop)
> 0,106721 s
>
> I doubt 32 will make a lot of difference. <g>

Aggreed, It seems like a curve that flattens out.

Rudy Velthuis [TeamB]

unread,

Jul 6, 2008, 11:56:47 AM7/6/08

to

Clément Doss wrote:

> Hi all,
>
> I really can't thank you enough for helping me with this one.
>
> I have to come up with a short term solution. And replacing the
> minmax should come first. On a second step I will improve the overall
> design. For example: While converting the doubles to integer, I can
> get the max and min. But I still must be sure that everything will
> work as expected. Once I understand better how this project works, I
> am sure I can get rid of that calls. In that part of the program,
> this routine is called 67%, followed by 58% of a matrix
> transformation routine. I just can't imagine the performance boost
> if I replace the minmax routine with the one you provided.
>
> I took the code you all posted and update Rudy's benchmark test.
> Those are the results...
>
>
> Array0 36484396 - 19 9999995
> Array1 35067716 - 19 9999995
> Array2 32072704 - 19 9999995
> Array3 18110452 - 19 9999995
> Array4 49961416 - 19 9999995
> Array5 105275380 - 19 9999995 --> Original code
> Array6 63633496 - 19 9999995
>
> Does this means that if I replace my code with the one in Array4,
> there will be a 2x performance increase? Is that it?

There will be a 2x perforamce increase for that part of your code. Of
course, you can get even better performance, but it is hard to tell
what the overall increase will be. It will very likely be less than 2x,
ISTM.

FWIW, you better profile in what part of your code the program spends
the most of the time. Since your percentages add up to more than 100%,
I assume that is the frequency of their calls, not the time you spend
in them.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"God gave men both a penis and a brain, but unfortunately not
enough blood supply to run both at the same time."
-- Robin Williams, commenting on the Clinton/Lewinsky affair

Rudy Velthuis [TeamB]

unread,

Jul 6, 2008, 12:11:08 PM7/6/08

to

Stig Johansen wrote:

I don't think it is Delphi 7, but it COULD be. I don't have D7
installed here, so I can't tell. Otherwise I'd try this with D7 too.

The big difference (yours well above 2 seconds, mine well under a
second) is of course the hardware, but the percentual differences
between the routines are not nearly as dramatic in your case as in
mine. I get a 7-fold speed increase, you get some 50%. I guess this is
because of differences in processors, but I can't completely rule out
differences in optimizers between D7 and D2007 either.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"They have computers, and they may have other weapons of mass
destruction." -- Janet Reno, Us Attorney General, 02-27-98

Hubert Seidel

unread,

Jul 6, 2008, 12:36:33 PM7/6/08

to

Hello Rudy,

"Rudy Velthuis [TeamB]" <newsg...@rvelthuis.de> schrieb im Newsbeitrag
news:xn0fsd3vxgzt2j...@rvelthuis.de...
...

> Original code
> 0,798055 s
> Using pointers
> 0,186170 s
> Using pointers (8/loop)
> 0,115466 s
> Using indices
> 0,184385 s
> Reading array value once
> 0,181304 s
> Nenad Trkulja - loop unrolling
> 0,261572 s
> Hubert Seidel - local min and max
> 0,161570 s
> Pierre le Riche - boolean trick
> 0,502879 s
> Trkulja/Seidel - loop unrolling and local min and max
> 0,134098 s
>
> It looks a lot like Trkulja/Seidel, but now with 8-fold loop unrolling
> instead of 4. I tried 16-fold, and got values like
>
> Using pointers (16/loop)
> 0,106721 s

...
Could you test the v4 ( news:g4qp3r$djm$00$1...@news.t-online.com 06.07.2008
17:41 )and post the result for comparisation?
Where can i found the code for "Using pointers (16/loop)"?

thx
Herby

--
http://www.hubert-seidel.de

Q Correll

unread,

Jul 6, 2008, 12:41:51 PM7/6/08

to

Hubert,

| Here ist the correction and an new third version with mnemonic "cmovl".

<chuckle> Got a bit caught-up with the challenge, eh? Good show!

--
Q

07/06/2008 09:41:03

Q Correll

unread,

Jul 6, 2008, 12:45:08 PM7/6/08

to

Stig,

| To be honest, i didn't know the optimizer was that good, so i just assumed
| that pointers were faster :-)

Ditto. <g>

--
Q

07/06/2008 09:44:46

Q Correll

unread,

Jul 6, 2008, 12:54:16 PM7/6/08

to

Stig,

| Hmm. can't resist .... here it a dialog from the movie TRON between
| MCP and Sark:

Never saw the movie but I get the drift. <g>

--
Q

07/06/2008 09:53:36

Q Correll

unread,

Jul 6, 2008, 12:51:32 PM7/6/08

to

Rudy,

| Not anymore. A combination of Trkulja and Seidel is the fastest now.

Yep,... just read through the thread and saw that too. Good "stuff!"

--
Q

07/06/2008 09:50:47

Hubert Seidel

unread,

Jul 6, 2008, 1:24:21 PM7/6/08

to

Hi Q,

"Q Correll" <qcor...@pacNObell.net> schrieb im Newsbeitrag

news:4870f5cf$1...@newsgroups.borland.com...

> Hubert,
> | Here ist the correction and an new third version with mnemonic "cmovl".
>
> <chuckle> Got a bit caught-up with the challenge, eh? Good show!

needs must :-)) It is an interesting matter. Actually i hoped to see more
basm :)
Also i am interested to the test-results with "cmovl" v4 with different
hardware.

Rudy Velthuis [TeamB]

unread,

Jul 6, 2008, 1:27:06 PM7/6/08

to

Hubert Seidel wrote:

> Could you test the v4 ( news:g4qp3r$djm$00$1...@news.t-online.com
> 06.07.2008 17:41 )and post the result for comparisation?

Such links tend to start Windows Mail for me, so I don't click them.
I'll try to find it using XanaNews.

> Where can i found the code for "Using pointers (16/loop)"?

I just expanded the 8/loop code to do 16 items at once, instead of 8.
See attachments for the 8/loop code.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"A good pun is its own reword."

Rudy Velthuis [TeamB]

unread,

Jul 6, 2008, 1:38:19 PM7/6/08

to

Hubert Seidel wrote:

> Could you test the v4 ( news:g4qp3r$djm$00$1...@news.t-online.com
> 06.07.2008 17:41 )and post the result for comparisation?

I'll post the latest to .attachments (MinMaxTest6.dpr)

Results for A[I] := I:

---------------------------------
Clément Doss - original code
0,565642 s

Stig Johansen - using pointers
0,186530 s

Stig Johansen - using pointers (16/loop)
0,110689 s

Rudy Velthuis - using indices
0,191127 s

Rudy Velthuis - reading array value once
0,184219 s

Nenad Trkulja - loop unrolling (4/loop)
0,297267 s

Hubert Seidel - local min and max

0,158404 s

Pierre le Riche - boolean trick

0,507377 s

Trkulja/Seidel - loop unrolling and local min and max

0,131369 s

Hubert Seidel - BASM v4
0,162332 s
---------------------------------

Results for A[I] := Random (10 * 1000 * 1000):

---------------------------------
Clément Doss - original code
0,830924 s

Stig Johansen - using pointers
0,182923 s

Stig Johansen - using pointers (16/loop)
0,108432 s

Rudy Velthuis - using indices
0,186078 s

Rudy Velthuis - reading array value once
0,307773 s

Nenad Trkulja - loop unrolling (4/loop)
0,198671 s

Hubert Seidel - local min and max

0,137631 s

Pierre le Riche - boolean trick

0,505107 s

Trkulja/Seidel - loop unrolling and local min and max

0,118472 s

Hubert Seidel - BASM v4
0,167197 s
---------------------------------

--
Rudy Velthuis [TeamB] http://www.teamb.com

"Subtlety is the art of saying what you think and getting out of
the way before it is understood." -- Unknown

Hubert Seidel

unread,

Jul 6, 2008, 2:13:53 PM7/6/08

to

Hi Rudy,

"Rudy Velthuis [TeamB]" <newsg...@rvelthuis.de> schrieb im Newsbeitrag

news:xn0fsd70ph41xo...@rvelthuis.de...

> Hubert Seidel wrote:
> > Could you test the v4 ( news:g4qp3r$djm$00$1...@news.t-online.com
> > 06.07.2008 17:41 )and post the result for comparisation?
> I'll post the latest to .attachments (MinMaxTest6.dpr)

I can't seeattachments :(

> Results for A[I] := I:

...
Thank you very much!
..

> Stig Johansen - using pointers (16/loop)
> 0,110689 s

...
and
...

> Results for A[I] := Random (10 * 1000 * 1000):

...

> Trkulja/Seidel - loop unrolling and local min and max
> 0,118472 s

..
are nearby...

> Hubert Seidel - BASM v4
> 0,167197 s

thx.please remove this test case in the future :(
I don't know why cmovcc is not so fast or faster?!?!
And why the U/V-pipline-changes has not any effect ):

I'm curious about the next acceleration (but i think, here is the end :).

Q Correll

unread,

Jul 6, 2008, 2:15:29 PM7/6/08

to

Hubert,

| With the new function v4, i only increase the performance a little.

"A little" being from ~9% to 12.5%. That's still non-trivial in my mind.

I think you're probably correct about ram access perhaps being the
limiting factor. The "front-side" bus speed may also be the reason why
Rudy doesn't see much difference in the performance of some of the various
approaches.

I don't know how to assure that the array is always in the on-die cpu
cache. Which might "normalize" the testing to "apples and apples."

--
Q

07/06/2008 11:09:33

Hubert Seidel

unread,

Jul 6, 2008, 3:25:25 PM7/6/08

to

Hello Q,

"Q Correll" <qcor...@pacNObell.net> schrieb im Newsbeitrag

news:48710bc1$1...@newsgroups.borland.com...

> Hubert,
> | With the new function v4, i only increase the performance a little.
> "A little" being from ~9% to 12.5%. That's still non-trivial in my mind.

It was interesting and funny challenge with a nice result for us.
...

> I don't know how to assure that the array is always in the on-die cpu
> cache. Which might "normalize" the testing to "apples and apples."

I've learnt to use unrolling and how work the pipeline :-)
The former was very new for me. What more would you wand?

mfg.
Herby

P.S:
(I used tesst-loop about 500 times and an arraysize of 50000, PIII 500)

--
http://www.hubert-seidel.de

Rudy Velthuis [TeamB]

unread,

Jul 6, 2008, 4:50:11 PM7/6/08

to

Hubert Seidel wrote:

> Hi Rudy,
>
> "Rudy Velthuis [TeamB]" <newsg...@rvelthuis.de> schrieb im
> Newsbeitrag news:xn0fsd70ph41xo...@rvelthuis.de...
> > Hubert Seidel wrote:
> > > Could you test the v4 ( news:g4qp3r$djm$00$1...@news.t-online.com
> > > 06.07.2008 17:41 )and post the result for comparisation?
> > I'll post the latest to .attachments (MinMaxTest6.dpr)
>
> I can't seeattachments :(

Huh?

--
Rudy Velthuis [TeamB] http://www.teamb.com

"We must all hear the universal call to like your neighbor like
you like to be liked yourself." -- George W. Bush

W. van Wezel

unread,

Jul 6, 2008, 6:04:12 PM7/6/08

to

Hi all,

>In this project, I need to find the max and min integer from an array.
>I came up with this code. It seems faster than the others I tried on.

Would threading help? I've uploaded a concept version to the attachments
group.

Wout

Rudy Velthuis [TeamB]

unread,

Jul 6, 2008, 6:33:18 PM7/6/08

to

W. van Wezel wrote:

Impressive numbers on my quad core. Not sure how well it does on a
single or dual core, or hyperthreading processor:

Original code (Clément)
0,552489 s min: 782 max: 999999032

Using threads (Wout)
0,050376 s min: 782 max: 999999032

--
Rudy Velthuis [TeamB] http://www.teamb.com

"Marry me and I'll never look at another horse!"
-- Groucho Marx

Clément Doss

unread,

Jul 7, 2008, 9:57:35 AM7/7/08

to

Outch!

Original code
1,185405 s min: 451 max: 999999880

Using threads
0,169088 s min: 451 max: 999999880

Dual Xeon Processor.

Amazing! Although it will take some time for me to digest your routine.

Congratulations! :D

Q Correll

unread,

Jul 7, 2008, 12:25:49 PM7/7/08

to

Hubert,

| Also i am interested to the test-results with "cmovl" v4 with different
| hardware.

IF I get some time I'll try to play with it. Right now I'm trying to get
several snakes back in the bag.

--
Q

07/07/2008 09:24:58

Q Correll

unread,

Jul 7, 2008, 12:24:39 PM7/7/08

to

Hubert,

| It was interesting and funny challenge with a nice result for us.

Yes. And I learned a bit, even though I haven't done any "real" asm
coding in years. <g>

--
Q

07/07/2008 09:23:47

Q Correll

unread,

Jul 7, 2008, 12:30:11 PM7/7/08

to

Rudy,

Thanks!

Interesting that Stig's pointers is quicker than Herby's basm. (No, I've
not looked at executable code. <g>)

--
Q

07/07/2008 09:28:01

Q Correll

unread,

Jul 7, 2008, 12:27:49 PM7/7/08

to

Hubert,

| I can't seeattachments :(

Rudy means the borland.public.attachments group.

--
Q

07/07/2008 09:26:59

Q Correll

unread,

Jul 7, 2008, 12:52:33 PM7/7/08

to

Rudy,

| Impressive numbers on my quad core. Not sure how well it does on a
| single or dual core, or hyperthreading processor:

See attachments.

--
Q

07/07/2008 09:52:20

Rudy Velthuis [TeamB]

unread,

Jul 7, 2008, 1:28:16 PM7/7/08

to

Q Correll wrote:

> Rudy,
>
> Thanks!
>
> Interesting that Stig's pointers is quicker than Herby's basm. (No,
> I've not looked at executable code. <g>)

Stig's pointers do what the Delphi compiler does too. Like I said, they
produce the EXACT same machine code.

I left his much faster loop unrolling sample as it is (it also uses
pointers), but I'm sure it could be coded without them and without the
casting trick and Delphi would also produce the same code. IOW, I don't
think that was so fast because of pointers, but because of loop
unrolling and the use of local variables for min and max.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"If you believe in telekinesis, raise my hand." -- Unknown

John Herbster

unread,

Jul 7, 2008, 2:08:08 PM7/7/08

to

>> I need the fastest routine.

"John Herbster" <herb-sci1_AT_sbcglobal.net> wrote
> I doubt it! ...

However, sometimes, maybe even often, when an over-simplified
problem is presented here, it can lead to discussions and
discoveries that transcend the original problem. --JohnH

Dan Bartlett

unread,

Jul 7, 2008, 3:23:06 PM7/7/08

to

"W. van Wezel" <hu...@informatch.com> wrote in message
news:4871...@newsgroups.borland.com...

Hi,

There appears to be a bug in this, the (aMax, aMin) parameters are never set
in MinMaxArrayThreads, so it reports the same results as found with the
original method. If you add:

aMin:=iMin;
aMax:=iMax;
to the end of MinMaxArrayThreads, the results are different to the original
method.

Dan.

Hubert Seidel

unread,

Jul 7, 2008, 3:52:19 PM7/7/08

to

High all,

"W. van Wezel" <hu...@informatch.com> schrieb im Newsbeitrag
news:4871...@newsgroups.borland.com...

> Would threading help? I've uploaded a concept version to the attachments
> group.

Now, i ask in the (german) Assembler-Newsgroup (
news:g4trtm$l7u$03$1...@news.t-online.com :-)

mfg.
Herby

P.S:
because english/germen no crossposting

--
http://www.hubert-seidel.de

Hubert Seidel

unread,

Jul 7, 2008, 3:54:55 PM7/7/08

to

Hi Q,

"Q Correll" <qcor...@pacNObell.net> schrieb im Newsbeitrag

news:48724405$1...@newsgroups.borland.com...

> Hubert,
> | I can't seeattachments :(
>
> Rudy means the borland.public.attachments group.

Ok, thanks a lot!

Dan Bartlett

unread,

Jul 7, 2008, 5:17:07 PM7/7/08

to

"Dan Bartlett" <danba...@users.sourceforge.net> wrote in message
news:48726d18$1...@newsgroups.borland.com...

> the results are different to the original method.
>

This seems to be a problem with the set-up code, all threads are just
sorting the first section of the array, since msgs[...].P :=@A[0].
I added an extra line to the set-up code in MinMaxArrayThreads, and it
appears to be returning the correct results now:

for I:=0 to numThreads-1 do
begin
if I=0
then msgs[I].lowArray:=0
else msgs[I].lowArray:=msgs[I-1].HighArray+1;
if I=numThreads-1
then msgs[I].highArray:=high(aArray)
else msgs[I].highArray:=msgs[I].lowArray+(high(aArray)) div numThreads;
msgs[i].P:=@aArray[msgs[i].lowArray]; //<====ADDED THIS LINE
end;

although a simpler function for finding min & max works better on my PC
(single core):

procedure MinMaxArray2( const aArray : Array of Integer; out aMax, aMin :
integer );
var
i : integer;
begin
aMax :=MinInt;
aMin :=MaxInt;
for i := low (aArray) to high( aArray ) do
begin
if aArray[i]>aMax then aMax:=aArray[i];
if aArray[i]<aMin then aMin:=aArray[i];
end;
end;

Dan.

Q Correll

unread,

Jul 7, 2008, 5:18:31 PM7/7/08

to

Rudy,

| IOW, I don't
| think that was so fast because of pointers, but because of loop
| unrolling and the use of local variables for min and max.

Makes sense to me.

--
Q

07/07/2008 14:18:22

Q Correll

unread,

Jul 7, 2008, 5:28:57 PM7/7/08

to

John,

| However, sometimes, maybe even often, when an over-simplified
| problem is presented here, it can lead to discussions and
| discoveries that transcend the original problem.

Yep. Even often. <g>

--
Q

07/07/2008 14:28:45

W. van Wezel

unread,

Jul 7, 2008, 5:26:03 PM7/7/08

to

Oeps,

You're right. Also:
for I:=1 to numThreads-1 do if msgs[I].tMax<iMax then iMax:=msgs[I].tMax;
should be
for I:=1 to numThreads-1 do if msgs[I].tMax>iMax then iMax:=msgs[I].tMax;

> although a simpler function for finding min & max works better on my PC
> (single core):

Yes, because you can't compensate for the overhead of threading. What
happens if you set numThreads to 1?

Wout

"Dan Bartlett" <danba...@users.sourceforge.net> wrote in message

news:487287cd$1...@newsgroups.borland.com...

Mike

unread,

Jul 7, 2008, 5:46:20 PM7/7/08

to

Ok.
After all changes -
Using 1 Thread (numThread=1)

Original code
0,737773 s min: 2884 max: 999999204
Using threads 1 attempt
0,136144 s min: 2884 max: 999999204
Using threads 2 attempt
0,138472 s min: 2884 max: 999999204
Using threads 3 attempt
0,136344 s min: 2884 max: 999999204
Using threads 4 attempt
0,132621 s min: 2884 max: 999999204

Using 2 Thread(numThreand=2)

Original code
0,745477 s min: 430 max: 999999336
Using threads 1 attempt
0,092165 s min: 430 max: 999999336
Using threads 2 attempt
0,090762 s min: 430 max: 999999336
Using threads 3 attempt
0,088455 s min: 430 max: 999999336
Using threads 4 attempt
0,088566 s min: 430 max: 999999336

Attempt 1,2,3,4 - just copy -paste using thread code.

Writeln('Using threads 2 attempt ');
Min:=0;
Max:=0;
QueryPerformanceStart;
for I := 1 to Times do
MinMaxArrayThreads(A, Max, Min);
Time := QueryPerformanceStop;
Writeln(Format('%10.6f s min: %d max: %d', [Time,Min,Max]));
Writeln;

Question - difference between threaded (numThread=2) and nonthreaded
(numThread=1) code about 0,05 s or about 70% - then why original code so
slow?

"W. van Wezel" <hu...@informatch.com> ???????/???????? ? ???????? ?????????:
news:48728a15$1...@newsgroups.borland.com...

Mike

unread,

Jul 7, 2008, 5:58:45 PM7/7/08

to

Ok.
With MinMaxArray2 function:
1 thread
Original code
0,494610 s min: 785 max: 999997475
MinMaxArray2 function
0,169332 s min: 785 max: 999997475
Using threads 1 attempt
0,133744 s min: 785 max: 999997475
Using threads 2 attempt
0,132757 s min: 785 max: 999997475
Using threads 3 attempt
0,133058 s min: 785 max: 999997475
Using threads 4 attempt
0,133705 s min: 785 max: 999997475

2 thread
Original code
0,454686 s min: 188 max: 999999257
MinMaxArray2 function
0,173825 s min: 188 max: 999999257
Using threads 1 attempt
0,094061 s min: 188 max: 999999257
Using threads 2 attempt
0,095713 s min: 188 max: 999999257
Using threads 3 attempt
0,092681 s min: 188 max: 999999257
Using threads 4 attempt
0,091880 s min: 188 max: 999999257

Why changes time of original code - don't know - i only insert MinMaxArray2
function.
My machine Intel Core 2 Duo E2140 @ 2.8Ghz processor with 4 gig of ram.

Dan Bartlett

unread,

Jul 7, 2008, 6:10:35 PM7/7/08

to

"W. van Wezel" <hu...@informatch.com> wrote in message

news:48728a15$1...@newsgroups.borland.com...

> Oeps,
>
> You're right. Also:
> for I:=1 to numThreads-1 do if msgs[I].tMax<iMax then iMax:=msgs[I].tMax;
> should be
> for I:=1 to numThreads-1 do if msgs[I].tMax>iMax then iMax:=msgs[I].tMax;
>

ah yeah, spotted that one too, but forgot about it :P

>> although a simpler function for finding min & max works better on my PC
>> (single core):
>
> Yes, because you can't compensate for the overhead of threading. What
> happens if you set numThreads to 1?
>
> Wout
>

I had to test them outside IDE, because it was having a large effect on the
result when a lot of threads were being used.

Original code: 0.774 seconds
Simple method: 0.372 seconds

1 Threads - 0.381 (0.390 inside IDE)
2 Threads - 0.382 (0.417 inside IDE)
3 Threads - 0.383
4 Threads - 0.385 (0.438 inside IDE)
8 threads - 0.385
16 threads - 0.387
128 threads - 0.44 (2.828 inside IDE)
256 threads - 0.503
1024 thread - 0.850

Dan.

W. van Wezel

unread,

Jul 7, 2008, 6:22:39 PM7/7/08

to

I've uploaded a fixed version. Also, I (hopefully correctly) changed the
"Stig Johansen - using pointers (16/loop)" procedure to be able to use it in
threads. Furthermore, the procedures now run 500 times instead of 100 to
enable a somewhat more accurate timing.

Can someone try it on an octacore?

Wout

"Dan Bartlett" <danba...@users.sourceforge.net> wrote in message

news:4872...@newsgroups.borland.com...

Rudy Velthuis [TeamB]

unread,

Jul 7, 2008, 7:13:19 PM7/7/08

to

W. van Wezel wrote:

> I've uploaded a fixed version. Also, I (hopefully correctly) changed
> the "Stig Johansen - using pointers (16/loop)" procedure to be able
> to use it in threads. Furthermore, the procedures now run 500 times
> instead of 100 to enable a somewhat more accurate timing.

Original code
3,968302 s min: 124 max: 999999970

Using threads / Hubert Seidel - local min and max
0,215553 s min: 124 max: 999999970

Using threads / Stig Johansen - using pointers (16/loop)
0,168581 s min: 124 max: 999999970

--
Rudy Velthuis [TeamB] http://www.teamb.com

"Everything that can be invented has been invented."
-- Charles H. Duell, Commissioner, U.S. Office of Patents, 1899

Q Correll

unread,

Jul 7, 2008, 7:52:25 PM7/7/08

to

Rudy,

3.4 GHz P4 Hyperthreading:

Original code
5.443988 s min: 599 max: 999997361

Using threads / Hubert Seidel - local min and max

0.936184 s min: 599 max: 999997361

Using threads / Stig Johansen - using pointers (16/loop)

0.559328 s min: 599 max: 999997361Original code

2.2 GHz Core2/Duo:

Original code
4.229453 s min: 747 max: 999999815

Using threads / Hubert Seidel - local min and max

0.392567 s min: 747 max: 999999815

Using threads / Stig Johansen - using pointers (16/loop)

0.288125 s min: 747 max: 999999815

Both run from a Command Box using the .exe.

--
Q

07/07/2008 16:52:15

Hubert Seidel

unread,

Jul 8, 2008, 4:30:28 AM7/8/08

to

Hello Dan,

"Dan Bartlett" <danba...@users.sourceforge.net> schrieb im Newsbeitrag
news:4872...@newsgroups.borland.com...

> 1 Threads - 0.381 (0.390 inside IDE)
> 2 Threads - 0.382 (0.417 inside IDE)
> 3 Threads - 0.383
> 4 Threads - 0.385 (0.438 inside IDE)
> 8 threads - 0.385
> 16 threads - 0.387
> 128 threads - 0.44 (2.828 inside IDE)
> 256 threads - 0.503
> 1024 thread - 0.850

More threads as CPUs has a negative effect through the overhead.
Iv you have Core 4, than better use 4 threads (on each one).

I think this has a positiv effect not until from a specific array-size.

(e.g. > 10000).
if (sizeof(aArray)<10000) // (I think is better as my english ;-)
then Fast4SmalArraysWithoutThreads(xyz)
else Fast4LargeArraysWithThreads(xyz);

Rudy Velthuis [TeamB]

unread,

Jul 8, 2008, 7:39:26 AM7/8/08

to

Q Correll wrote:

> Both run from a Command Box using the .exe.

Same here.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"The nice thing about egotists is that they don't talk about
other people." -- Lucille S. Harper

Dennis

unread,

Jul 8, 2008, 1:58:09 PM7/8/08

to

Hi

> And why the U/V-pipline-changes has not any effect ):

The Pentium had U/V pipes. A PPRO, P3 etc has not. All modern processors
(except Atom) have many pipes and an out of order core. No "pipeline"
optimizations in SW will be effective, because the cores does them for you.

Best regards
Dennis Kjaer Christensen

Dennis

unread,

Jul 8, 2008, 2:00:20 PM7/8/08

to

Hi

All the Delphi compilers produce nearly the same code. CPU's are very
different. I can run benchmarks on a number of CPU's for you, including
Phenom and Penryn.