20 views

Skip to first unread message

Oct 28, 2021, 8:47:54 PM10/28/21

to

I have a data structure in one of my applications which is an array of 32 bytes.

So basically 256 bits.

The code needs to check if all bits are empty or if all bits are full. So basically all clear or all set.

Currently this data structure uses an additional counter of 1 byte to smartly keep track of when a bit is set or cleared individually and it acts upon it's conclusion =D

(secret code =D)

So basically this code uses 1 branch to check the counter, and sometimes it does a little decrementing and sometimes a little incrementing.

So the code itself is pretty efficient.

There is ofcourse a slight little "problem" / "inefficiency" with this technique it consumes 1 additional byte.

So 1 out of 33 bytes. Maybe there are other implicit penalties where this 33 data structure is misaligned because of this 33 byte structure.

So reducing this structure to just 32 bytes could not only safe memory but it might also increase performance a little bit, to maybe a lot ?

So at least memory wise this would give a 3% reduction of memory consumption.

Performance wise I don't know.

Currently checking 256 bits with code with be much worse.

Possibilities:

1. Check each bit individually, this would be pretty stupid and slow.

2. Check each byte, so 32 comparisons, still pretty slow.

3. Check each word, so 16 comparisons, a bit better.

4. Check each longword, so 8 comparisons, a lot better but still slow.

5. Check each quadword, so 4 comparisons, only applicable in 64 bit application.

On some systems 64 bit integer compare can still take two clock cycles.

Is there something better and faster ?

Is there a 128 bit compare in a single clock or two ?

Is there a 256 bit compare in a single clock or two ?

Just wondering ! ;)

Anyway in case such an instruction would exist for example:

if Are256BitsClear( DataStructure )

if Are256BitsSet( DataStructure )

then the counter could most likely/ofcourse be removed =D

and the data structure would be a nice 32 byte fit and a bit more efficient.

Perhaps it might even make sense to use 4 comparisons if the alignment somehow gives it more performance then the overhead of 4 comparisons lol.

Bye,

Skybuck.

So basically 256 bits.

The code needs to check if all bits are empty or if all bits are full. So basically all clear or all set.

Currently this data structure uses an additional counter of 1 byte to smartly keep track of when a bit is set or cleared individually and it acts upon it's conclusion =D

(secret code =D)

So basically this code uses 1 branch to check the counter, and sometimes it does a little decrementing and sometimes a little incrementing.

So the code itself is pretty efficient.

There is ofcourse a slight little "problem" / "inefficiency" with this technique it consumes 1 additional byte.

So 1 out of 33 bytes. Maybe there are other implicit penalties where this 33 data structure is misaligned because of this 33 byte structure.

So reducing this structure to just 32 bytes could not only safe memory but it might also increase performance a little bit, to maybe a lot ?

So at least memory wise this would give a 3% reduction of memory consumption.

Performance wise I don't know.

Currently checking 256 bits with code with be much worse.

Possibilities:

1. Check each bit individually, this would be pretty stupid and slow.

2. Check each byte, so 32 comparisons, still pretty slow.

3. Check each word, so 16 comparisons, a bit better.

4. Check each longword, so 8 comparisons, a lot better but still slow.

5. Check each quadword, so 4 comparisons, only applicable in 64 bit application.

On some systems 64 bit integer compare can still take two clock cycles.

Is there something better and faster ?

Is there a 128 bit compare in a single clock or two ?

Is there a 256 bit compare in a single clock or two ?

Just wondering ! ;)

Anyway in case such an instruction would exist for example:

if Are256BitsClear( DataStructure )

if Are256BitsSet( DataStructure )

then the counter could most likely/ofcourse be removed =D

and the data structure would be a nice 32 byte fit and a bit more efficient.

Perhaps it might even make sense to use 4 comparisons if the alignment somehow gives it more performance then the overhead of 4 comparisons lol.

Bye,

Skybuck.

Oct 28, 2021, 9:01:53 PM10/28/21

to

I was just scanning the internet/googling it and I came across this, which is a pretty neat trick to start with lol:

https://github.com/holiman/uint256/blob/master/uint256.go

Particularly this code, written in go, I have never programmed in go but I can still read/understand some of it:

// Cmp compares z and x and returns:

//

// -1 if z < x

// 0 if z == x

// +1 if z > x

//

func (z *Int) Cmp(x *Int) (r int) {

// z < x <=> z - x < 0 i.e. when subtraction overflows.

d0, carry := bits.Sub64(z[0], x[0], 0)

d1, carry := bits.Sub64(z[1], x[1], carry)

d2, carry := bits.Sub64(z[2], x[2], carry)

d3, carry := bits.Sub64(z[3], x[3], carry)

if carry == 1 {

return -1

}

if d0|d1|d2|d3 == 0 {

return 0

}

return 1

}

The interesting part is in:

if d0|d1|d2|d3 == 0 {

Apperently what this code does is it retrieves 4x64 bit quantities.

OR-s them together.

And then checks the result of that with a single branch.

So at the assembler/instruction/compiler generated code level this would probably result in something like:

mov register1, 64 bits from array[0]

mov register2, 64 bits from array[1]

mov register3, 64 bits from array[2]

mov register4, 64 bits from array[3]

or register1, register2

or register1, register3

or register1, register4

jump if zero/jump if not zero.

Something like that, so this could be pretty efficient ! at least "branch times" are avoided/costly comparisons.

The jump zero flag instruction could also be pretty efficient, even more efficient than a cmp, because the or instruction may already have set it.

The only slight downside to this would be 7 extra instructions taking up some instruction cache.

All in all not to bad for a first google ! ;) This might be quite quick actually ! LOL and might be worth it.

And the nicest part about it is it doesn't require any special assembler or avx or mmx or sse instructions... hmmmm...

(And ofcourse the compiler has to be slightly efficient at generating this exact assembler code, but it can probably do that ! ;) maybe it has even some more tricks

up it's sleeve.)

Can you do better than this though ? Wondering...

Bye,

Skybuck.

https://github.com/holiman/uint256/blob/master/uint256.go

Particularly this code, written in go, I have never programmed in go but I can still read/understand some of it:

// Cmp compares z and x and returns:

//

// -1 if z < x

// 0 if z == x

// +1 if z > x

//

func (z *Int) Cmp(x *Int) (r int) {

// z < x <=> z - x < 0 i.e. when subtraction overflows.

d0, carry := bits.Sub64(z[0], x[0], 0)

d1, carry := bits.Sub64(z[1], x[1], carry)

d2, carry := bits.Sub64(z[2], x[2], carry)

d3, carry := bits.Sub64(z[3], x[3], carry)

if carry == 1 {

return -1

}

if d0|d1|d2|d3 == 0 {

return 0

}

return 1

}

The interesting part is in:

if d0|d1|d2|d3 == 0 {

Apperently what this code does is it retrieves 4x64 bit quantities.

OR-s them together.

And then checks the result of that with a single branch.

So at the assembler/instruction/compiler generated code level this would probably result in something like:

mov register1, 64 bits from array[0]

mov register2, 64 bits from array[1]

mov register3, 64 bits from array[2]

mov register4, 64 bits from array[3]

or register1, register2

or register1, register3

or register1, register4

jump if zero/jump if not zero.

Something like that, so this could be pretty efficient ! at least "branch times" are avoided/costly comparisons.

The jump zero flag instruction could also be pretty efficient, even more efficient than a cmp, because the or instruction may already have set it.

The only slight downside to this would be 7 extra instructions taking up some instruction cache.

All in all not to bad for a first google ! ;) This might be quite quick actually ! LOL and might be worth it.

And the nicest part about it is it doesn't require any special assembler or avx or mmx or sse instructions... hmmmm...

(And ofcourse the compiler has to be slightly efficient at generating this exact assembler code, but it can probably do that ! ;) maybe it has even some more tricks

up it's sleeve.)

Can you do better than this though ? Wondering...

Bye,

Skybuck.

Oct 28, 2021, 9:12:22 PM10/28/21

to

This would make a fine implementation for Are256BitsClear.

This leaves Are256BitsSet still to be implemented...

Hmm can the same trick be done for AllSet ?

Probably with AND and checking for FF also decimally known as 255 :)

So for Are256BitsSet this code may work:

mov register1, 64 bits from array[0]

mov register2, 64 bits from array[1]

mov register3, 64 bits from array[2]

mov register4, 64 bits from array[3]

and register1, register2

and register1, register3

and register1, register4

jump if zero/jump if not zero.

^ Not sure about the jump if not zero instruction... this would probably be invalid.

It needs to be 255 or nothing at all.

My initially thought was to maybe negate it or something, turn 255 specific to zero somehow...

Maybe subtract 255 from it to see if it's zero... if not then it's not full.

Hmmm ?

Bye for now,

Skybuck.

This leaves Are256BitsSet still to be implemented...

Hmm can the same trick be done for AllSet ?

Probably with AND and checking for FF also decimally known as 255 :)

So for Are256BitsSet this code may work:

mov register1, 64 bits from array[0]

mov register2, 64 bits from array[1]

mov register3, 64 bits from array[2]

mov register4, 64 bits from array[3]

and register1, register3

and register1, register4

jump if zero/jump if not zero.

It needs to be 255 or nothing at all.

My initially thought was to maybe negate it or something, turn 255 specific to zero somehow...

Maybe subtract 255 from it to see if it's zero... if not then it's not full.

Hmmm ?

Bye for now,

Skybuck.

Oct 28, 2021, 10:18:17 PM10/28/21

to

I found it fun to scan the internet to see if there is anybody that has/needs these same methods.

I came across unity documentation:

https://docs.unity3d.com/Packages/com.unity.render-pipelines.core@7.2/api/UnityEngine.Rendering.BitArray256.html

Describing

allTrue

True if all bits are 1.

Declaration

public bool allTrue { get; }

and it took a while to find the exact source code but it's here:

https://github.com/needle-mirror/com.unity.render-pipelines.core/blob/master/Runtime/Utilities/BitArray.cs

public bool allTrue => data1 == ulong.MaxValue && data2 == ulong.MaxValue && data3 == ulong.MaxValue && data4 == ulong.MaxValue;

from this class to give it a bit more context:

/// <summary>

/// Bit array of size 256.

/// </summary>

[Serializable]

[System.Diagnostics.DebuggerDisplay("{this.GetType().Name} {humanizedData}")]

public struct BitArray256 : IBitArray

{

[SerializeField]

ulong data1;

[SerializeField]

ulong data2;

[SerializeField]

ulong data3;

[SerializeField]

ulong data4;

/// <summary>Number of elements in the bit array.</summary>

public uint capacity => 256u;

/// <summary>True if all bits are 0.</summary>

public bool allFalse => data1 == 0uL && data2 == 0uL && data3 == 0uL && data4 == 0uL;

/// <summary>True if all bits are 1.</summary>

public bool allTrue => data1 == ulong.MaxValue && data2 == ulong.MaxValue && data3 == ulong.MaxValue && data4 == ulong.MaxValue;

/// <summary>Returns the bit array in a human readable form.</summary>

So let's go back to this code:

public bool allTrue => data1 == ulong.MaxValue && data2 == ulong.MaxValue && data3 == ulong.MaxValue && data4 == ulong.MaxValue;

It's a bit funny to see both solutions for allFalse and AllTrue the naming is nice, in my case you could think of it as AllEmpty and AllFull hint hint about secret code lol.

Anyway

At least the allFalse code could be written a bit more efficient HAHA ! =D

as already explored.

But can the allTrue code be written more efficient as well ? I think so...

At least the and's could be done on the data itself

(data1 && data2 && data3 && data4) == ulong.MaxValue

3's and's

1 comparison to max value.

Bye for now,

Skybuck.

I came across unity documentation:

https://docs.unity3d.com/Packages/com.unity.render-pipelines.core@7.2/api/UnityEngine.Rendering.BitArray256.html

Describing

allTrue

True if all bits are 1.

Declaration

public bool allTrue { get; }

and it took a while to find the exact source code but it's here:

https://github.com/needle-mirror/com.unity.render-pipelines.core/blob/master/Runtime/Utilities/BitArray.cs

public bool allTrue => data1 == ulong.MaxValue && data2 == ulong.MaxValue && data3 == ulong.MaxValue && data4 == ulong.MaxValue;

from this class to give it a bit more context:

/// <summary>

/// Bit array of size 256.

/// </summary>

[Serializable]

[System.Diagnostics.DebuggerDisplay("{this.GetType().Name} {humanizedData}")]

public struct BitArray256 : IBitArray

{

[SerializeField]

ulong data1;

[SerializeField]

ulong data2;

[SerializeField]

ulong data3;

[SerializeField]

ulong data4;

/// <summary>Number of elements in the bit array.</summary>

public uint capacity => 256u;

/// <summary>True if all bits are 0.</summary>

public bool allFalse => data1 == 0uL && data2 == 0uL && data3 == 0uL && data4 == 0uL;

/// <summary>True if all bits are 1.</summary>

public bool allTrue => data1 == ulong.MaxValue && data2 == ulong.MaxValue && data3 == ulong.MaxValue && data4 == ulong.MaxValue;

/// <summary>Returns the bit array in a human readable form.</summary>

So let's go back to this code:

public bool allTrue => data1 == ulong.MaxValue && data2 == ulong.MaxValue && data3 == ulong.MaxValue && data4 == ulong.MaxValue;

It's a bit funny to see both solutions for allFalse and AllTrue the naming is nice, in my case you could think of it as AllEmpty and AllFull hint hint about secret code lol.

Anyway

At least the allFalse code could be written a bit more efficient HAHA ! =D

as already explored.

But can the allTrue code be written more efficient as well ? I think so...

At least the and's could be done on the data itself

(data1 && data2 && data3 && data4) == ulong.MaxValue

3's and's

1 comparison to max value.

Bye for now,

Skybuck.

Oct 29, 2021, 5:29:37 AM10/29/21

to

On Thu, 28 Oct 2021 19:18:16 -0700 (PDT)

skybuck2000 <skybu...@hotmail.com> wrote:

[] not asm code, but still something easily done:

mov ecx,NumElements

mov esi,Array-8 ; 8 bytes=256 bits

CheckElement:

add esi,8 ; next array element

mov eax,word [esi]

or eax,word [esi+4] ;; use 'and' for all set,

; if you want both tests use an additional register!

jz Found

loop CheckElement

;;

; prt none found

ret

Found:

; prt allzeros found at [esi]

ret

;;test data

; 0123456789ABCDEF0123456789ABCDEF

Array:

dd 0

dd 0 ; all zero case

dd 1

dd 0 ; not all zero case

dd -1

dd -1 ; all set

--

Bah, and indeed Humbug.

skybuck2000 <skybu...@hotmail.com> wrote:

[] not asm code, but still something easily done:

mov ecx,NumElements

mov esi,Array-8 ; 8 bytes=256 bits

CheckElement:

add esi,8 ; next array element

mov eax,word [esi]

or eax,word [esi+4] ;; use 'and' for all set,

; if you want both tests use an additional register!

jz Found

loop CheckElement

;;

; prt none found

ret

Found:

; prt allzeros found at [esi]

ret

;;test data

; 0123456789ABCDEF0123456789ABCDEF

Array:

dd 0

dd 0 ; all zero case

dd 1

dd 0 ; not all zero case

dd -1

dd -1 ; all set

--

Bah, and indeed Humbug.

Oct 29, 2021, 8:07:01 AM10/29/21

to

I have created a "professional" github page, so you guys can download the example source code properly and quickly using GIT and don't have to hassle with newsgroup crappy messaging formats:

TData and TLink optimization project

https://github.com/SkybuckFlying/TDataTLinkOptimizationProject

The aim of the project is to:

Optimize the IsEmpty and IsFull routines for TData and TLink.

Optimize the Empty and Full routines for TData and TLink.

Implement WIN64 platform support by implementing BitSet and BitReset.

YES BABY ! =D It's 2021 ! Time to start sharing code a little bit more professionally and effectively and actually good usuable.

However in case you don't know how to use GIT and still want to have a look at the source code then I will post it here anyway,

we can also discuss further ideas for optimization on newsgroups, I am particularly interested in any high speed assembler assistence for x86 and/or x64 instruction set !

unit UnitTDataTLink;

{

TestProgram/Optimization Project for TData and TLink

version 0.02 created on 29 october 2021 by Skybuck Flying

TData contains an array of 32 bytes, in other words 256 bits.

TLink contains an array of 256 pointers.

The mission of this project is to optimize the IsEmpty and IsFull functions

to run as fast as possible on modern x86/x64 processors.

Bonus would be to optimize Empty and Full routines, however beware

the TLink structure uses special pointer values for empty and full.

The verification routines do not need to be optimized and should not

be optimized and are only there to verify any optimized code.

I will sure this code in github for easy downloading and uploading.

Would be cool to get some help with optimizing this code.

The original code used a mCount : byte; member to optimize the code

however this increases the TData structure to 33 bytes which could

be causing misalignment and consumes 3% more memory.

The TLink structure would also fit better if mCount is illiminated.

Both structures contained a mCount : byte variable and have been removed.

Now the same functionality to check for empty and full should be implemented

with as fast as code as possible without using any extra member fields

for the structures to keep them memory efficient like this.

Perhaps these link functions can be optimized with AVX-512 or so ?

}

{

version 0.03 created on 29 october 2021 by Skybuck Flying

Console program turned into multi-device-application project for

multi-platform support.

Project and Source files seperated into seperate folders to avoid

file clutter.

Only WIN32 platform currently implemented.

Other platforms possible thanks to multi-device-project/applicaation.

For some reason it was not possible to add platforms to a console project.

So I modified code so it can work with a multi-device-application, this

added a lot of platforms.

WIN64 platform support is desired !

The rest would be a bonus !

}

interface

uses

Classes; // for TStrings

type

TData = packed record

mData : packed array[0..31] of byte;

procedure Empty;

procedure Full;

function IsEmpty : boolean;

function IsFull : boolean;

// debug routines

function VerifyIsEmpty : boolean;

function VerifyIsFull : boolean;

function VerifyOne : boolean;

function VerifyRandomButOne : boolean;

procedure VerifyAll;

end;

TLink = packed record

mLink : packed array[0..255] of pointer;

procedure Empty;

procedure Full;

function IsEmpty : boolean;

function IsFull : boolean;

// debug routines

function VerifyIsEmpty : boolean;

function VerifyIsFull : boolean;

function VerifyOne : boolean;

function VerifyRandomButOne : boolean;

procedure VerifyAll;

end;

procedure MainTestProgram( ParaOutput : TStrings );

implementation

uses

SysUtils; // for IntToStr

const

ConstDataTestLoops = 1000000; // one million ! =D

ConstLinkTestLoops = 100000; // one hundred thousand ! =D

var

const_empty : pointer;

const_full : pointer;

var

vOutput : TStrings;

procedure OutputMessage( ParaString : string );

begin

vOutput.Add( ParaString );

end;

// only works on win32 platform, need win64 platform versions !

{$ifdef WIN32}

// unlocked versions (multi-threading-unsafe)

procedure BitReset( ParaMemory : pointer; ParaBitIndex : integer );

asm

btr [eax], edx

end;

procedure BitSet( ParaMemory : pointer; ParaBitIndex : integer );

asm

bts [eax], edx

end;

// locked versions (multi-threading-safe)

procedure LockedBitReset( ParaMemory : pointer; ParaBitIndex : integer );

asm

LOCK btr [eax], edx

end;

procedure LockedBitSet( ParaMemory : pointer; ParaBitIndex : integer );

asm

LOCK bts [eax], edx

end;

{$endif}

{$ifdef WIN64}

// unlocked versions (multi-threading-unsafe)

// optimize me ?

procedure BitReset( ParaMemory : pointer; ParaBitIndex : integer );

begin

// to be implemented

ERROR NOT IMPLEMENTED

end;

procedure BitSet( ParaMemory : pointer; ParaBitIndex : integer );

begin

// to be implemented

ERROR NOT IMPLEMENTED

end;

// not really required for this test program but would be nice to have a 64 bit version of this as well.

procedure LockedBitReset( ParaMemory : pointer; ParaBitIndex : integer );

begin

end;

procedure LockedBitSet( ParaMemory : pointer; ParaBitIndex : integer );

begin

end;

{$endif}

{$ifdef WIN32}

procedure TData.Empty;

type

TUInt32Array = packed array[0..7] of UInt32;

begin

TUInt32Array(mData)[0] := 0;

TUInt32Array(mData)[1] := 0;

TUInt32Array(mData)[2] := 0;

TUInt32Array(mData)[3] := 0;

TUInt32Array(mData)[4] := 0;

TUInt32Array(mData)[5] := 0;

TUInt32Array(mData)[6] := 0;

TUInt32Array(mData)[7] := 0;

end;

{$endif}

{$ifdef WIN64}

procedure TData.Empty;

type

TUInt64Array = packed array[0..3] of UInt64;

begin

TUInt64Array(mData)[0] := 0;

TUInt64Array(mData)[1] := 0;

TUInt64Array(mData)[2] := 0;

TUInt64Array(mData)[3] := 0;

end;

{$endif}

{$ifdef WIN32}

procedure TData.Full;

type

TUInt32Array = packed array[0..7] of longword;

begin

// 11223344

TUInt32Array(mData)[0] := $FFFFFFFF;

TUInt32Array(mData)[1] := $FFFFFFFF;

TUInt32Array(mData)[2] := $FFFFFFFF;

TUInt32Array(mData)[3] := $FFFFFFFF;

TUInt32Array(mData)[4] := $FFFFFFFF;

TUInt32Array(mData)[5] := $FFFFFFFF;

TUInt32Array(mData)[6] := $FFFFFFFF;

TUInt32Array(mData)[7] := $FFFFFFFF;

end;

{$endif}

{$ifdef WIN64}

procedure TData.Full;

type

TUInt64Array = packed array[0..3] of UInt64;

begin

// 1122334455667788

TUInt64Array(mData)[0] := $FFFFFFFFFFFFFFFF;

TUInt64Array(mData)[1] := $FFFFFFFFFFFFFFFF;

TUInt64Array(mData)[2] := $FFFFFFFFFFFFFFFF;

TUInt64Array(mData)[3] := $FFFFFFFFFFFFFFFF;

end;

{$endif}

{$ifdef WIN32}

function TData.IsEmpty : boolean;

type

TUInt32Array = packed array[0..7] of longword;

var

vResult : longword;

begin

result := False;

vResult :=

TUInt32Array(mData)[0] or

TUInt32Array(mData)[1] or

TUInt32Array(mData)[2] or

TUInt32Array(mData)[3] or

TUInt32Array(mData)[4] or

TUInt32Array(mData)[5] or

TUInt32Array(mData)[6] or

TUInt32Array(mData)[7];

if vResult = 0 then

begin

result := True;

end;

end;

{$endif}

{$ifdef WIN64}

function TData.IsEmpty : boolean;

type

TUInt64Array = packed array[0..3] of UInt64;

var

vResult : UInt64;

begin

result := False;

vResult :=

TUInt64Array(mData)[0] or

TUInt64Array(mData)[1] or

TUInt64Array(mData)[2] or

TUInt64Array(mData)[3];

if vResult = 0 then

begin

result := True;

end;

end;

{$endif}

{$ifdef WIN32}

function TData.IsFull : boolean;

type

TUInt32Array = packed array[0..7] of longword;

var

vResult : UInt64;

begin

result := False;

vResult :=

TUInt32Array(mData)[0] and

TUInt32Array(mData)[1] and

TUInt32Array(mData)[2] and

TUInt32Array(mData)[3] and

TUInt32Array(mData)[4] and

TUInt32Array(mData)[5] and

TUInt32Array(mData)[6] and

TUInt32Array(mData)[7];

// 11223344

if vResult = $FFFFFFFF then

begin

result := True;

end;

end;

{$endif}

{$ifdef WIN64}

function TData.IsFull : boolean;

type

TUInt64Array = packed array[0..3] of UInt64;

var

vResult : UInt64;

begin

result := False;

vResult :=

TUInt64Array(mData)[0] and

TUInt64Array(mData)[1] and

TUInt64Array(mData)[2] and

TUInt64Array(mData)[3];

// 1122334455667788

if vResult = $FFFFFFFFFFFFFFFF then

begin

result := True;

end;

end;

{$endif}

// TData verification routines

function TData.VerifyIsEmpty : boolean;

var

vIndex : integer;

begin

result := True;

for vIndex := 0 to 31 do

begin

if mData[vIndex] <> 0 then

begin

result := False;

end;

end;

end;

function TData.VerifyIsFull : boolean;

var

vIndex : integer;

begin

result := True;

for vIndex := 0 to 31 do

begin

if mData[vIndex] <> 255 then

begin

result := False;

end;

end;

end;

function TData.VerifyOne : boolean;

var

vIndex : integer;

begin

result := True;

for vIndex := 0 to 255 do

begin

Empty;

BitSet(@mData[vIndex div 8], vIndex mod 8);

if not

(

(

(not IsEmpty) and (not VerifyIsEmpty)

)

and

(

(not IsFull) and (not VerifyIsFull)

)

) then

begin

result := False;

end;

end;

end;

// heavy duty verification

function TData.VerifyRandomButOne : boolean;

var

vTestLoop : integer;

vTestIndex : integer;

vIndex : integer;

begin

result := True;

for vTestLoop := 1 to ConstDataTestLoops do

begin

Empty;

// missing one index ;)

for vIndex := 1 to 255 do

begin

vTestIndex := Random(256); // gives range 0 to 255

BitSet( @mData[vTestIndex div 8], vTestIndex mod 8 );

end;

if IsEmpty or VerifyIsEmpty or IsFull or VerifyIsFull then

begin

result := False;

OutputMessage('TData.VerifyRandomButOne: Error detected at TestLoop: ' + IntToStr(vTestLoop) );

break;

end;

end;

end;

procedure TData.VerifyAll;

var

vData : TData;

begin

vData.Empty;

if vData.IsEmpty = vData.VerifyIsEmpty then

begin

OutputMessage('Verified: vData.IsEmpty');

end else

begin

OutputMessage('Verification Error: vData.IsEmpty');

end;

vData.Full;

if vData.IsFull = vData.VerifyIsFull then

begin

OutputMessage('Verified: vData.IsFull');

end else

begin

OutputMessage('Verification Error: vData.IsFull');

end;

if vData.VerifyOne then

begin

OutputMessage('Verified: vData one bit set and not IsEmpty and not IsFull');

end else

begin

OutputMessage('Verification Error: vData one bit set and not IsEmpty and not IsFull');

end;

if vData.VerifyRandomButOne then

begin

OutputMessage('Verified: vData.VerifyRandomButOne' );

end else

begin

OutputMessage('Verification Error: vData.VerifyRandomButOne' );

end;

end;

// TLink

procedure TLink.Empty;

var

vIndex : integer;

begin

for vIndex := 0 to 255 do

begin

mLink[vIndex] := const_empty;

end;

end;

procedure TLink.Full;

var

vIndex : integer;

begin

for vIndex := 0 to 255 do

begin

mLink[vIndex] := const_full;

end;

end;

function TLink.IsEmpty : boolean;

var

vIndex : integer;

begin

result := True;

for vIndex := 0 to 255 do

begin

if mLink[vIndex] <> const_empty then

begin

result := False;

Break;

end;

end;

end;

function TLink.IsFull : boolean;

var

vIndex : integer;

begin

result := True;

for vIndex := 0 to 255 do

begin

if mLink[vIndex] <> const_full then

begin

result := False;

Break;

end;

end;

end;

// TLink verification routines

function TLink.VerifyIsEmpty : boolean;

var

vIndex : integer;

begin

result := True;

for vIndex := 0 to 255 do

begin

if mLink[vIndex] <> const_empty then

begin

result := False;

end;

end;

end;

function TLink.VerifyIsFull : boolean;

var

vIndex : integer;

begin

result := True;

for vIndex := 0 to 255 do

begin

if mLink[vIndex] <> const_full then

begin

result := False;

end;

end;

end;

function TLink.VerifyOne : boolean;

var

vIndex : integer;

begin

result := True;

for vIndex := 0 to 255 do

begin

Empty;

if SizeOf(Pointer) = 4 then

begin

mLink[vIndex] := pointer($11223344);

end else

if SizeOf(Pointer) = 8 then

begin

mLink[vIndex] := pointer($1122334455667788);

end;

if not

(

(

(not IsEmpty) and (not VerifyIsEmpty)

)

and

(

(not IsFull) and (not VerifyIsFull)

)

) then

begin

result := False;

end;

end;

end;

// heavy duty verification

function TLink.VerifyRandomButOne : boolean;

var

vTestLoop : integer;

vTestIndex : integer;

vIndex : integer;

begin

result := True;

for vTestLoop := 1 to ConstLinkTestLoops do

begin

Empty;

// missing one index ;)

for vIndex := 1 to 255 do

begin

vTestIndex := Random(256); // gives range 0 to 255

if SizeOf(Pointer) = 4 then

begin

mLink[vTestIndex] := pointer($11223344);

end else

if SizeOf(Pointer) = 8 then

begin

mLink[vTestIndex] := pointer($1122334455667788);

end;

end;

if IsEmpty or VerifyIsEmpty or IsFull or VerifyIsFull then

begin

result := False;

OutputMessage('TLink.VerifyRandomButOne: Error detected at TestLoop: ' + IntToStr(vTestLoop) );

break;

end;

end;

end;

procedure TLink.VerifyAll;

var

vLink : TLink;

begin

vLink.Empty;

if vLink.IsEmpty = vLink.VerifyIsEmpty then

begin

OutputMessage('Verified: vLink.IsEmpty');

end else

begin

OutputMessage('Verification Error: vLink.IsEmpty');

end;

vLink.Full;

if vLink.IsFull = vLink.VerifyIsFull then

begin

OutputMessage('Verified: vLink.IsFull');

end else

begin

OutputMessage('Verification Error: vLink.IsFull');

end;

if vLink.VerifyOne then

begin

OutputMessage('Verified: vLink one link set and not IsEmpty and not IsFull');

end else

begin

OutputMessage('Verification Error: vLink one link set and not IsEmpty and not IsFull');

end;

if vLink.VerifyRandomButOne then

begin

OutputMessage('Verified: vLink.VerifyRandomButOne' );

end else

begin

OutputMessage('Verification Error: vLink.VerifyRandomButOne' );

end;

end;

procedure MainTestProgram( ParaOutput : TStrings );

var

vData : TData;

vLink : TLink;

begin

vOutput := ParaOutput;

OutputMessage('program started');

OutputMessage('vData.VerifyAll');

vData.VerifyAll;

OutputMessage('');

OutputMessage('vLink.VerifyAll');

vLink.VerifyAll;

OutputMessage('');

OutputMessage('program finished');

end;

initialization

// should be place in unit initialization section but for now here is ok

// or initialize to low unused pointer value like 1 and 2.

// first 64k value range should be safe for windows virtual memory.

const_empty := @const_empty;

const_full := @const_full;

end.

Put memo and button on a multi-device-application form and add this code:

uses UnitTDataTLink;

procedure TForm1.Button1Click(Sender: TObject);

begin

try

Memo1.Lines.Add('Computing... please wait...');

Application.ProcessMessages;

UnitTDataTLink.MainTestProgram( Memo1.Lines );

Memo1.Lines.Add('Computations done.');

except

on E: Exception do

Memo1.Lines.Add(E.ClassName + ': ' + E.Message);

end;

end;

// Output should be:

{

program started

vData.VerifyAll

Verified: vData.IsEmpty

Verified: vData.IsFull

Verified: vData one bit set and not IsEmpty and not IsFull

Verified: vData.VerifyRandomButOne

vLink.VerifyAll

Verified: vLink.IsEmpty

Verified: vLink.IsFull

Verified: vLink one link set and not IsEmpty and not IsFull

Verified: vLink.VerifyRandomButOne

program finished

}

Bye for now,

Skybuck.

TData and TLink optimization project

https://github.com/SkybuckFlying/TDataTLinkOptimizationProject

The aim of the project is to:

Optimize the IsEmpty and IsFull routines for TData and TLink.

Optimize the Empty and Full routines for TData and TLink.

Implement WIN64 platform support by implementing BitSet and BitReset.

YES BABY ! =D It's 2021 ! Time to start sharing code a little bit more professionally and effectively and actually good usuable.

However in case you don't know how to use GIT and still want to have a look at the source code then I will post it here anyway,

we can also discuss further ideas for optimization on newsgroups, I am particularly interested in any high speed assembler assistence for x86 and/or x64 instruction set !

unit UnitTDataTLink;

{

TestProgram/Optimization Project for TData and TLink

version 0.02 created on 29 october 2021 by Skybuck Flying

TData contains an array of 32 bytes, in other words 256 bits.

TLink contains an array of 256 pointers.

The mission of this project is to optimize the IsEmpty and IsFull functions

to run as fast as possible on modern x86/x64 processors.

Bonus would be to optimize Empty and Full routines, however beware

the TLink structure uses special pointer values for empty and full.

The verification routines do not need to be optimized and should not

be optimized and are only there to verify any optimized code.

I will sure this code in github for easy downloading and uploading.

Would be cool to get some help with optimizing this code.

The original code used a mCount : byte; member to optimize the code

however this increases the TData structure to 33 bytes which could

be causing misalignment and consumes 3% more memory.

The TLink structure would also fit better if mCount is illiminated.

Both structures contained a mCount : byte variable and have been removed.

Now the same functionality to check for empty and full should be implemented

with as fast as code as possible without using any extra member fields

for the structures to keep them memory efficient like this.

Perhaps these link functions can be optimized with AVX-512 or so ?

}

{

version 0.03 created on 29 october 2021 by Skybuck Flying

Console program turned into multi-device-application project for

multi-platform support.

Project and Source files seperated into seperate folders to avoid

file clutter.

Only WIN32 platform currently implemented.

Other platforms possible thanks to multi-device-project/applicaation.

For some reason it was not possible to add platforms to a console project.

So I modified code so it can work with a multi-device-application, this

added a lot of platforms.

WIN64 platform support is desired !

The rest would be a bonus !

}

interface

uses

Classes; // for TStrings

type

TData = packed record

mData : packed array[0..31] of byte;

procedure Empty;

procedure Full;

function IsEmpty : boolean;

function IsFull : boolean;

// debug routines

function VerifyIsEmpty : boolean;

function VerifyIsFull : boolean;

function VerifyOne : boolean;

function VerifyRandomButOne : boolean;

procedure VerifyAll;

end;

TLink = packed record

mLink : packed array[0..255] of pointer;

procedure Empty;

procedure Full;

function IsEmpty : boolean;

function IsFull : boolean;

// debug routines

function VerifyIsEmpty : boolean;

function VerifyIsFull : boolean;

function VerifyOne : boolean;

function VerifyRandomButOne : boolean;

procedure VerifyAll;

end;

procedure MainTestProgram( ParaOutput : TStrings );

implementation

uses

SysUtils; // for IntToStr

const

ConstDataTestLoops = 1000000; // one million ! =D

ConstLinkTestLoops = 100000; // one hundred thousand ! =D

var

const_empty : pointer;

const_full : pointer;

var

vOutput : TStrings;

procedure OutputMessage( ParaString : string );

begin

vOutput.Add( ParaString );

end;

// only works on win32 platform, need win64 platform versions !

{$ifdef WIN32}

// unlocked versions (multi-threading-unsafe)

procedure BitReset( ParaMemory : pointer; ParaBitIndex : integer );

asm

btr [eax], edx

end;

procedure BitSet( ParaMemory : pointer; ParaBitIndex : integer );

asm

bts [eax], edx

end;

// locked versions (multi-threading-safe)

procedure LockedBitReset( ParaMemory : pointer; ParaBitIndex : integer );

asm

LOCK btr [eax], edx

end;

procedure LockedBitSet( ParaMemory : pointer; ParaBitIndex : integer );

asm

LOCK bts [eax], edx

end;

{$endif}

{$ifdef WIN64}

// unlocked versions (multi-threading-unsafe)

// optimize me ?

procedure BitReset( ParaMemory : pointer; ParaBitIndex : integer );

begin

// to be implemented

ERROR NOT IMPLEMENTED

end;

procedure BitSet( ParaMemory : pointer; ParaBitIndex : integer );

begin

// to be implemented

ERROR NOT IMPLEMENTED

end;

// not really required for this test program but would be nice to have a 64 bit version of this as well.

procedure LockedBitReset( ParaMemory : pointer; ParaBitIndex : integer );

begin

end;

procedure LockedBitSet( ParaMemory : pointer; ParaBitIndex : integer );

begin

end;

{$endif}

{$ifdef WIN32}

procedure TData.Empty;

type

TUInt32Array = packed array[0..7] of UInt32;

begin

TUInt32Array(mData)[0] := 0;

TUInt32Array(mData)[1] := 0;

TUInt32Array(mData)[2] := 0;

TUInt32Array(mData)[3] := 0;

TUInt32Array(mData)[4] := 0;

TUInt32Array(mData)[5] := 0;

TUInt32Array(mData)[6] := 0;

TUInt32Array(mData)[7] := 0;

end;

{$endif}

{$ifdef WIN64}

procedure TData.Empty;

type

TUInt64Array = packed array[0..3] of UInt64;

begin

TUInt64Array(mData)[0] := 0;

TUInt64Array(mData)[1] := 0;

TUInt64Array(mData)[2] := 0;

TUInt64Array(mData)[3] := 0;

end;

{$endif}

{$ifdef WIN32}

procedure TData.Full;

type

TUInt32Array = packed array[0..7] of longword;

begin

// 11223344

TUInt32Array(mData)[0] := $FFFFFFFF;

TUInt32Array(mData)[1] := $FFFFFFFF;

TUInt32Array(mData)[2] := $FFFFFFFF;

TUInt32Array(mData)[3] := $FFFFFFFF;

TUInt32Array(mData)[4] := $FFFFFFFF;

TUInt32Array(mData)[5] := $FFFFFFFF;

TUInt32Array(mData)[6] := $FFFFFFFF;

TUInt32Array(mData)[7] := $FFFFFFFF;

end;

{$endif}

{$ifdef WIN64}

procedure TData.Full;

type

TUInt64Array = packed array[0..3] of UInt64;

begin

// 1122334455667788

TUInt64Array(mData)[0] := $FFFFFFFFFFFFFFFF;

TUInt64Array(mData)[1] := $FFFFFFFFFFFFFFFF;

TUInt64Array(mData)[2] := $FFFFFFFFFFFFFFFF;

TUInt64Array(mData)[3] := $FFFFFFFFFFFFFFFF;

end;

{$endif}

{$ifdef WIN32}

function TData.IsEmpty : boolean;

type

TUInt32Array = packed array[0..7] of longword;

var

vResult : longword;

begin

result := False;

vResult :=

TUInt32Array(mData)[0] or

TUInt32Array(mData)[1] or

TUInt32Array(mData)[2] or

TUInt32Array(mData)[3] or

TUInt32Array(mData)[4] or

TUInt32Array(mData)[5] or

TUInt32Array(mData)[6] or

TUInt32Array(mData)[7];

if vResult = 0 then

begin

result := True;

end;

end;

{$endif}

{$ifdef WIN64}

function TData.IsEmpty : boolean;

type

TUInt64Array = packed array[0..3] of UInt64;

var

vResult : UInt64;

begin

result := False;

vResult :=

TUInt64Array(mData)[0] or

TUInt64Array(mData)[1] or

TUInt64Array(mData)[2] or

TUInt64Array(mData)[3];

if vResult = 0 then

begin

result := True;

end;

end;

{$endif}

{$ifdef WIN32}

function TData.IsFull : boolean;

type

TUInt32Array = packed array[0..7] of longword;

var

vResult : UInt64;

begin

result := False;

vResult :=

TUInt32Array(mData)[0] and

TUInt32Array(mData)[1] and

TUInt32Array(mData)[2] and

TUInt32Array(mData)[3] and

TUInt32Array(mData)[4] and

TUInt32Array(mData)[5] and

TUInt32Array(mData)[6] and

TUInt32Array(mData)[7];

// 11223344

if vResult = $FFFFFFFF then

begin

result := True;

end;

end;

{$endif}

{$ifdef WIN64}

function TData.IsFull : boolean;

type

TUInt64Array = packed array[0..3] of UInt64;

var

vResult : UInt64;

begin

result := False;

vResult :=

TUInt64Array(mData)[0] and

TUInt64Array(mData)[1] and

TUInt64Array(mData)[2] and

TUInt64Array(mData)[3];

// 1122334455667788

if vResult = $FFFFFFFFFFFFFFFF then

begin

result := True;

end;

end;

{$endif}

// TData verification routines

function TData.VerifyIsEmpty : boolean;

var

vIndex : integer;

begin

result := True;

for vIndex := 0 to 31 do

begin

if mData[vIndex] <> 0 then

begin

result := False;

end;

end;

end;

function TData.VerifyIsFull : boolean;

var

vIndex : integer;

begin

result := True;

for vIndex := 0 to 31 do

begin

if mData[vIndex] <> 255 then

begin

result := False;

end;

end;

end;

function TData.VerifyOne : boolean;

var

vIndex : integer;

begin

result := True;

for vIndex := 0 to 255 do

begin

Empty;

BitSet(@mData[vIndex div 8], vIndex mod 8);

if not

(

(

(not IsEmpty) and (not VerifyIsEmpty)

)

and

(

(not IsFull) and (not VerifyIsFull)

)

) then

begin

result := False;

end;

end;

end;

// heavy duty verification

function TData.VerifyRandomButOne : boolean;

var

vTestLoop : integer;

vTestIndex : integer;

vIndex : integer;

begin

result := True;

for vTestLoop := 1 to ConstDataTestLoops do

begin

Empty;

// missing one index ;)

for vIndex := 1 to 255 do

begin

vTestIndex := Random(256); // gives range 0 to 255

BitSet( @mData[vTestIndex div 8], vTestIndex mod 8 );

end;

if IsEmpty or VerifyIsEmpty or IsFull or VerifyIsFull then

begin

result := False;

OutputMessage('TData.VerifyRandomButOne: Error detected at TestLoop: ' + IntToStr(vTestLoop) );

break;

end;

end;

end;

procedure TData.VerifyAll;

var

vData : TData;

begin

vData.Empty;

if vData.IsEmpty = vData.VerifyIsEmpty then

begin

OutputMessage('Verified: vData.IsEmpty');

end else

begin

OutputMessage('Verification Error: vData.IsEmpty');

end;

vData.Full;

if vData.IsFull = vData.VerifyIsFull then

begin

OutputMessage('Verified: vData.IsFull');

end else

begin

OutputMessage('Verification Error: vData.IsFull');

end;

if vData.VerifyOne then

begin

OutputMessage('Verified: vData one bit set and not IsEmpty and not IsFull');

end else

begin

OutputMessage('Verification Error: vData one bit set and not IsEmpty and not IsFull');

end;

if vData.VerifyRandomButOne then

begin

OutputMessage('Verified: vData.VerifyRandomButOne' );

end else

begin

OutputMessage('Verification Error: vData.VerifyRandomButOne' );

end;

end;

// TLink

procedure TLink.Empty;

var

vIndex : integer;

begin

for vIndex := 0 to 255 do

begin

mLink[vIndex] := const_empty;

end;

end;

procedure TLink.Full;

var

vIndex : integer;

begin

for vIndex := 0 to 255 do

begin

mLink[vIndex] := const_full;

end;

end;

function TLink.IsEmpty : boolean;

var

vIndex : integer;

begin

result := True;

for vIndex := 0 to 255 do

begin

if mLink[vIndex] <> const_empty then

begin

result := False;

Break;

end;

end;

end;

function TLink.IsFull : boolean;

var

vIndex : integer;

begin

result := True;

for vIndex := 0 to 255 do

begin

if mLink[vIndex] <> const_full then

begin

result := False;

Break;

end;

end;

end;

// TLink verification routines

function TLink.VerifyIsEmpty : boolean;

var

vIndex : integer;

begin

result := True;

for vIndex := 0 to 255 do

begin

if mLink[vIndex] <> const_empty then

begin

result := False;

end;

end;

end;

function TLink.VerifyIsFull : boolean;

var

vIndex : integer;

begin

result := True;

for vIndex := 0 to 255 do

begin

if mLink[vIndex] <> const_full then

begin

result := False;

end;

end;

end;

function TLink.VerifyOne : boolean;

var

vIndex : integer;

begin

result := True;

for vIndex := 0 to 255 do

begin

Empty;

if SizeOf(Pointer) = 4 then

begin

mLink[vIndex] := pointer($11223344);

end else

if SizeOf(Pointer) = 8 then

begin

mLink[vIndex] := pointer($1122334455667788);

end;

if not

(

(

(not IsEmpty) and (not VerifyIsEmpty)

)

and

(

(not IsFull) and (not VerifyIsFull)

)

) then

begin

result := False;

end;

end;

end;

// heavy duty verification

function TLink.VerifyRandomButOne : boolean;

var

vTestLoop : integer;

vTestIndex : integer;

vIndex : integer;

begin

result := True;

for vTestLoop := 1 to ConstLinkTestLoops do

begin

Empty;

// missing one index ;)

for vIndex := 1 to 255 do

begin

vTestIndex := Random(256); // gives range 0 to 255

if SizeOf(Pointer) = 4 then

begin

mLink[vTestIndex] := pointer($11223344);

end else

if SizeOf(Pointer) = 8 then

begin

mLink[vTestIndex] := pointer($1122334455667788);

end;

end;

if IsEmpty or VerifyIsEmpty or IsFull or VerifyIsFull then

begin

result := False;

OutputMessage('TLink.VerifyRandomButOne: Error detected at TestLoop: ' + IntToStr(vTestLoop) );

break;

end;

end;

end;

procedure TLink.VerifyAll;

var

vLink : TLink;

begin

vLink.Empty;

if vLink.IsEmpty = vLink.VerifyIsEmpty then

begin

OutputMessage('Verified: vLink.IsEmpty');

end else

begin

OutputMessage('Verification Error: vLink.IsEmpty');

end;

vLink.Full;

if vLink.IsFull = vLink.VerifyIsFull then

begin

OutputMessage('Verified: vLink.IsFull');

end else

begin

OutputMessage('Verification Error: vLink.IsFull');

end;

if vLink.VerifyOne then

begin

OutputMessage('Verified: vLink one link set and not IsEmpty and not IsFull');

end else

begin

OutputMessage('Verification Error: vLink one link set and not IsEmpty and not IsFull');

end;

if vLink.VerifyRandomButOne then

begin

OutputMessage('Verified: vLink.VerifyRandomButOne' );

end else

begin

OutputMessage('Verification Error: vLink.VerifyRandomButOne' );

end;

end;

procedure MainTestProgram( ParaOutput : TStrings );

var

vData : TData;

vLink : TLink;

begin

vOutput := ParaOutput;

OutputMessage('program started');

OutputMessage('vData.VerifyAll');

vData.VerifyAll;

OutputMessage('');

OutputMessage('vLink.VerifyAll');

vLink.VerifyAll;

OutputMessage('');

OutputMessage('program finished');

end;

initialization

// should be place in unit initialization section but for now here is ok

// or initialize to low unused pointer value like 1 and 2.

// first 64k value range should be safe for windows virtual memory.

const_empty := @const_empty;

const_full := @const_full;

end.

Put memo and button on a multi-device-application form and add this code:

uses UnitTDataTLink;

procedure TForm1.Button1Click(Sender: TObject);

begin

try

Memo1.Lines.Add('Computing... please wait...');

Application.ProcessMessages;

UnitTDataTLink.MainTestProgram( Memo1.Lines );

Memo1.Lines.Add('Computations done.');

except

on E: Exception do

Memo1.Lines.Add(E.ClassName + ': ' + E.Message);

end;

end;

// Output should be:

{

program started

vData.VerifyAll

Verified: vData.IsEmpty

Verified: vData.IsFull

Verified: vData one bit set and not IsEmpty and not IsFull

Verified: vData.VerifyRandomButOne

vLink.VerifyAll

Verified: vLink.IsEmpty

Verified: vLink.IsFull

Verified: vLink one link set and not IsEmpty and not IsFull

Verified: vLink.VerifyRandomButOne

program finished

}

Bye for now,

Skybuck.

Nov 23, 2021, 11:23:56 AM11/23/21

to

Conceptually this is very smart to invert the logic !

Your idea (Peter 'Shaggy' Haywood) has given me a new idea for the code:

Instead of wasting cpu cycles/times checking everything instead only check what is needed to come to a certain "inverted conclusions".

It is indeed not necessary to check all bits to see if they are all clear, if one of them is not clear then obviously the data is not empty, and the function can exit earlier.

It is indeed not necessary to check all bits to see if they are all set, if one of them is not set then obviously the data is not full, and the function can exit earlier.

HAHA ! Pretty smart and brilliant ! ;) =D

Or is it ? hmmmm

Maybe not.... it depends...

Let's analyze code below:

{$ifdef WIN32}

function TData.IsEmpty : boolean;

type

TUInt32Array = packed array[0..7] of longword;

var

vResult : longword;

begin

result := False;

vResult :=

TUInt32Array(mData)[0] or

TUInt32Array(mData)[1] or

TUInt32Array(mData)[2] or

TUInt32Array(mData)[3] or

TUInt32Array(mData)[4] or

TUInt32Array(mData)[5] or

TUInt32Array(mData)[6] or

TUInt32Array(mData)[7];

if vResult = 0 then

begin

result := True;

end;

end;

{$endif}

Theoretically it needs to pull in 8x 32 bit integers, this could strain the memory bus somewhat, slow down the cpu, while it's waiting for data.... and it may pollute the l1 data cache unnecessarily.

On the other hand it does only one branch comparison.

Now the alternative would be to do 8 branch comparisons. Perhaps there is a golden ground in the middle, maybe only do 2 or 4 comparisons that is a possibility as well.

So different combinations could be tried to see which one performs best.

Here is where a little bit of artificial intelligence could be handy.

The application could measure all versions during active/real world usage with statisticall data tracking... for the first few minutes or so...

Then once another variations and statistics collected it can switch to non-statistical methods and pick the one which it believes is the fastest.

Anyway let's create a worst case counter example and analyze that, untested code:

To cool thing is this also gets rid of the extra variable.

function TData.IsEmpty : boolean;

type

TUInt32Array = packed array[0..7] of longword;

begin

result := False;

// if not empty then exit.

if TUInt32Array(mData)[0]) <> 0 then exit;

if TUInt32Array(mData)[1]) <> 0 then exit;

if TUInt32Array(mData)[2]) <> 0 then exit;

if TUInt32Array(mData)[3]) <> 0 then exit;

if TUInt32Array(mData)[4]) <> 0 then exit;

if TUInt32Array(mData)[5]) <> 0 then exit;

if TUInt32Array(mData)[6]) <> 0 then exit;

if TUInt32Array(mData)[7]) <> 0 then exit;

result := True; // if execution reaches here then return true.

end;

{$endif}

Pretty cool,

These are still 8 data accesses, no or instructions, 8 comparisons, possibly 8 jumps

On average this will probably run at 4,4,4

It will take up some more branch predicators, less data cache

Hard to say which one will run faster ;) for now my bet would be on the first one, because or instructions are 1 cycle and produces can be 15 or something.

However data access can be in the range of 400 nanoseconds, maybe 400 cycles, that could severely impact performance.

And this this newer version might run faster, depends on how could the L1 data cache lining is, if it's good maybe 10 cycles per data access... for a average total of 40.

Hmmm... I may have to benchmark this sometime ! ;)

And the isFull version:

// untested code

{$ifdef WIN32}

function TData.IsFull : boolean;

type

TUInt32Array = packed array[0..7] of longword;

begin

result := False;

// if not full exit

if TUInt32Array(mData)[0] <> $FFFFFFFF then exit;

if TUInt32Array(mData)[1] <> $FFFFFFFF then exit;

if TUInt32Array(mData)[2] <> $FFFFFFFF then exit;

if TUInt32Array(mData)[3] <> $FFFFFFFF then exit;

if TUInt32Array(mData)[4] <> $FFFFFFFF then exit;

if TUInt32Array(mData)[5] <> $FFFFFFFF then exit;

if TUInt32Array(mData)[6] <> $FFFFFFFF then exit;

if TUInt32Array(mData)[7] <> $FFFFFFFF then exit;

result := True;

end;

{$endif}

Ok interesting,

Now we truely have something to benchmark ! ;) =D

I am kinda busy so this will have to wait time a further time, but other people are welcome to benchmark this themselfes ! ;) :)

Bye for now,

And thanks for the feedback and ideas !

Skybuck Flying ! =D

Your idea (Peter 'Shaggy' Haywood) has given me a new idea for the code:

Instead of wasting cpu cycles/times checking everything instead only check what is needed to come to a certain "inverted conclusions".

It is indeed not necessary to check all bits to see if they are all clear, if one of them is not clear then obviously the data is not empty, and the function can exit earlier.

It is indeed not necessary to check all bits to see if they are all set, if one of them is not set then obviously the data is not full, and the function can exit earlier.

HAHA ! Pretty smart and brilliant ! ;) =D

Or is it ? hmmmm

Maybe not.... it depends...

Let's analyze code below:

{$ifdef WIN32}

function TData.IsEmpty : boolean;

type

TUInt32Array = packed array[0..7] of longword;

var

vResult : longword;

begin

result := False;

vResult :=

TUInt32Array(mData)[0] or

TUInt32Array(mData)[1] or

TUInt32Array(mData)[2] or

TUInt32Array(mData)[3] or

TUInt32Array(mData)[4] or

TUInt32Array(mData)[5] or

TUInt32Array(mData)[6] or

TUInt32Array(mData)[7];

if vResult = 0 then

begin

result := True;

end;

end;

{$endif}

On the other hand it does only one branch comparison.

Now the alternative would be to do 8 branch comparisons. Perhaps there is a golden ground in the middle, maybe only do 2 or 4 comparisons that is a possibility as well.

So different combinations could be tried to see which one performs best.

Here is where a little bit of artificial intelligence could be handy.

The application could measure all versions during active/real world usage with statisticall data tracking... for the first few minutes or so...

Then once another variations and statistics collected it can switch to non-statistical methods and pick the one which it believes is the fastest.

Anyway let's create a worst case counter example and analyze that, untested code:

To cool thing is this also gets rid of the extra variable.

function TData.IsEmpty : boolean;

type

TUInt32Array = packed array[0..7] of longword;

result := False;

// if not empty then exit.

if TUInt32Array(mData)[0]) <> 0 then exit;

if TUInt32Array(mData)[1]) <> 0 then exit;

if TUInt32Array(mData)[2]) <> 0 then exit;

if TUInt32Array(mData)[3]) <> 0 then exit;

if TUInt32Array(mData)[4]) <> 0 then exit;

if TUInt32Array(mData)[5]) <> 0 then exit;

if TUInt32Array(mData)[6]) <> 0 then exit;

if TUInt32Array(mData)[7]) <> 0 then exit;

result := True; // if execution reaches here then return true.

end;

{$endif}

Pretty cool,

These are still 8 data accesses, no or instructions, 8 comparisons, possibly 8 jumps

On average this will probably run at 4,4,4

It will take up some more branch predicators, less data cache

Hard to say which one will run faster ;) for now my bet would be on the first one, because or instructions are 1 cycle and produces can be 15 or something.

However data access can be in the range of 400 nanoseconds, maybe 400 cycles, that could severely impact performance.

And this this newer version might run faster, depends on how could the L1 data cache lining is, if it's good maybe 10 cycles per data access... for a average total of 40.

Hmmm... I may have to benchmark this sometime ! ;)

And the isFull version:

// untested code

{$ifdef WIN32}

function TData.IsFull : boolean;

type

TUInt32Array = packed array[0..7] of longword;

result := False;

// if not full exit

if TUInt32Array(mData)[0] <> $FFFFFFFF then exit;

if TUInt32Array(mData)[1] <> $FFFFFFFF then exit;

if TUInt32Array(mData)[2] <> $FFFFFFFF then exit;

if TUInt32Array(mData)[3] <> $FFFFFFFF then exit;

if TUInt32Array(mData)[4] <> $FFFFFFFF then exit;

if TUInt32Array(mData)[5] <> $FFFFFFFF then exit;

if TUInt32Array(mData)[6] <> $FFFFFFFF then exit;

if TUInt32Array(mData)[7] <> $FFFFFFFF then exit;

result := True;

end;

{$endif}

Ok interesting,

Now we truely have something to benchmark ! ;) =D

I am kinda busy so this will have to wait time a further time, but other people are welcome to benchmark this themselfes ! ;) :)

Bye for now,

And thanks for the feedback and ideas !

Skybuck Flying ! =D

Nov 23, 2021, 1:58:03 PM11/23/21

to

On 23/11/2021 17:23, skybuck2000 wrote:

,,,,

much faster and shorter than IF THEN jump for an all zero check.

and not a single branch is required with setcc/cmov after the last OR.

similar story for check if all set: AND all and INC last.

if the INC doesn't set ZF then it wasn't an all bits set value.

__

wolfgang

,,,,

> Ok interesting,

>

> Now we truely have something to benchmark ! ;) =D

>

> I am kinda busy so this will have to wait time a further time, but other people are welcome to benchmark this themselfes ! ;) :)

>

don't waste your time because its more than obvious that OR chains are
>

> Now we truely have something to benchmark ! ;) =D

>

> I am kinda busy so this will have to wait time a further time, but other people are welcome to benchmark this themselfes ! ;) :)

>

much faster and shorter than IF THEN jump for an all zero check.

and not a single branch is required with setcc/cmov after the last OR.

similar story for check if all set: AND all and INC last.

if the INC doesn't set ZF then it wasn't an all bits set value.

__

wolfgang

Nov 24, 2021, 3:37:38 PM11/24/21

to

And the next 4 integers are not in L1 Data cache and on second 4k page.

This could at least incur a 400 cycles or so ??

Me no expert in that but that little I think to know.

You seem to be assuming all data is in L1 data cache, and that could be your flaw.

Though on average a high percentage would indeed be in L1 data cache.

Though I would not be surprise if somehow you are wrong ! =D

Perhaps because of other processing occuring, there is always something to trip things up ! =D

And many different computers, but ok, todays systems... can be assumed ! ;)

Though this kind of software should be able to run on 8486, pentium, athlons, skylakes, threadrippers and any future computers/processors.

L1 caches seem to be shrinking in size on many-core CPUs.

If that shrinkage continues, code like this may be affected.

Bye,

Skybuck

Nov 25, 2021, 6:16:57 AM11/25/21

to

On 24/11/2021 21:37, skybuck2000 wrote:

> On Tuesday, November 23, 2021 at 7:58:03 PM UTC+1, wolfgang kern wrote:

>> On 23/11/2021 17:23, skybuck2000 wrote:

>> ,,,,

>>> Ok interesting,

>>>

>>> Now we truely have something to benchmark ! ;) =D

>>>

>>> I am kinda busy so this will have to wait time a further time, but other people are welcome to benchmark this themselfes ! ;) :)

>>>

>> don't waste your time because its more than obvious that OR chains are

>> much faster and shorter than IF THEN jump for an all zero check.

>> and not a single branch is required with setcc/cmov after the last OR.

>>

>> similar story for check if all set: AND all and INC last.

>> if the INC doesn't set ZF then it wasn't an all bits set value.

>

> What is the first 4 integers are in the L1 Data cache or on a page boundary near the end.

I'd say this were only possible on a stupid memory layout
> On Tuesday, November 23, 2021 at 7:58:03 PM UTC+1, wolfgang kern wrote:

>> On 23/11/2021 17:23, skybuck2000 wrote:

>> ,,,,

>>> Ok interesting,

>>>

>>> Now we truely have something to benchmark ! ;) =D

>>>

>>> I am kinda busy so this will have to wait time a further time, but other people are welcome to benchmark this themselfes ! ;) :)

>>>

>> don't waste your time because its more than obvious that OR chains are

>> much faster and shorter than IF THEN jump for an all zero check.

>> and not a single branch is required with setcc/cmov after the last OR.

>>

>> similar story for check if all set: AND all and INC last.

>> if the INC doesn't set ZF then it wasn't an all bits set value.

>

> What is the first 4 integers are in the L1 Data cache or on a page boundary near the end.

> And the next 4 integers are not in L1 Data cache and on second 4k page.

> This could at least incur a 400 cycles or so ??

> Me no expert in that but that little I think to know.

>

> You seem to be assuming all data is in L1 data cache, and that could be your flaw.

>

> Though on average a high percentage would indeed be in L1 data cache.

>

> Though I would not be surprise if somehow you are wrong ! =D

>

> Perhaps because of other processing occuring, there is always something to trip things up ! =D

>

> And many different computers, but ok, todays systems... can be assumed ! ;)

>

> Though this kind of software should be able to run on 8486, pentium, athlons, skylakes, threadrippers and any future computers/processors.

>

> L1 caches seem to be shrinking in size on many-core CPUs.

>

> If that shrinkage continues, code like this may be affected.

IF THEN structs will always lose against OR/AND-chains in terms of size

and speed regardless of where the data are.

and again any split memory compound sound pretty careless programming.

__

wolfgang

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu