D2007: Typecasting 32 bit integer/longword to 32 bit single floating point not possible ?!?

Skybuck Flying

unread,

Jan 4, 2010, 7:59:27 PM1/4/10

to

Hello,

Currently I want to experiment with moving memory values/integers/longwords
onto the gpu.

Since the gpu works with floating points only, I was wondering if it might
be possible to simply "stuff" the integer/longwords into the gpu floating
point variables/memory cells and see what happens...

I was hoping for "bitpattern perfect" copies.

I now realize this is pretty much a pipe dream since floating point format
works totally different to integer format... so "stuffing integers" via
memory copy into "floating points" probably makes no sense at all... the
floating point value would end up being something weird... and that could
mean the gpu is not able to work with it correctly ?!?

Unless maybe the gpu/cg shader typecasts to int might work ? But I doubt
that's gonna work because that's probably something special... it's probably
the "gpu pretending to support integers ?" while in reality it's still using
floats ?

However it was a worthy Delphi experiment since it shows "converting from
longword/integer to single and back again via assignments/rounds" is not
flawless... and produces imprecision.

I am not completely happy with D2007 not supporting these type of wacky
typecasts... if I want to typecast then what's the fricking problem ?! I
feel slightly frustrated about that... but at the same time... typecasting
it in this case would have made little sense but still...

Should Delphi have support for wacky typecasts like this ? It remains open
for debate me thinks ?! ;) :)

(I will start a new thread asking how to convert between integers and single
without loss of precision if possible at all ? This thread is just to prove
that "assignments" doesn't cut it and leads to problems ! ;) :))

// *** Begin of Program ***

program TestProgram;

{$APPTYPE CONSOLE}

{

Test typecasting a 32 bit longword, or 32 bit integer to a 32 bit single
(floating point)

In theory this should be possible since they both use 32 bits ?!

version 0.01 created on 5 january 2010 by Skybuck Flying

Short story:

Delphi 2007 does not allow it,
Using assignments produces different bit patterns and precision loss, which
is bad !

Long story:

In Delphi 2007 cannot typecast integer/longword to single ?!?

I want to use a typecast to "copy" the integer/longword bitpattern towards
the floating point variable to see what happens and for a bitpattern perfect
copy ?!?

I assume an assignment will not give a bit perfect copy ?
vSingle := vLongword;
vSingle := vInteger;

Let's test this assumption with a quick loop that compares the bit
patterns/memory ;)

My assumption was correct, assignments do not give a bit perfect copy.

However this does not mean that conversion might produce imprecision that
needs
to be tested seperatedly... and has been done as well...

And indeed conversion also produces imprecision/errors which proves
typecasting
is necessary in case single floating point values are to be filled with bit
perfect
patterns ! otherwise a memcopy/move is necessary ! ;)

Example of program output:

*** begin of program output ***:

program started

CompareBitPatterns... aborted.
Longword to single assignment difference detected for value: 1
Single: 1.00000000000000E+0000
Longword: 1

CheckForConversionProblems... aborted.
Longword conversion error detected at value: 16777217
vSingle: 1.67772160000000E+0007
vLongword: 16777217
vNewLongword: 16777216

program finished

*** end of program output ***

}

uses
SysUtils;

procedure CompareBitPatterns;
var
vValue : int64;
vCount : int64;

vLongword : longword;
vInteger : integer;

vSingle : single;
begin
write('CompareBitPatterns... ');

vValue := 0;

vCount := 4;
vCount := vCount * 1024;
vCount := vCount * 1024;
vCount := vCount * 1024;

while vValue < vCount do
begin
vLongword := Longword(vValue);
vInteger := Integer(vValue);

vSingle := vLongword;

// ***
// BOOM: value 1 already shows a difference !
// using two watches in "memory dump" mode shows indeed different bit
patterns !
// that doesn't necessarily mean that precision is lost but it becomes
highly doubtfull.
// let's further test if precision is lost by converting back and forth
and see if any precision
// is lost... in a next test procedure.
// ***
if not CompareMem( @vLongword, @vSingle, 4 ) then
begin
writeln('aborted.');
writeln('Longword to single assignment difference detected for value: ',
vValue );
writeln('Single: ', vSingle );
writeln('Longword: ', vLongword );
writeln;
exit;
end;

vSingle := vInteger;

if not CompareMem( @vInteger, @vSingle, 4 ) then
begin
write('aborted.');
writeln('Integer to single assignment difference detected for value: ',
vValue );
writeln('Single: ', vSingle );
writeln('Integer: ', vInteger );
writeln;
exit;
end;

vValue := vValue + 1;
end;
writeln('done.');
end;

procedure CheckForConversionProblems;
var
vValue : int64;
vCount : int64;

vLongword : longword;
vNewLongword : longword;

vInteger : integer;
vNewInteger : integer;

vSingle : single;
begin
write('CheckForConversionProblems... ');

vValue := 0;

vCount := 4;
vCount := vCount * 1024;
vCount := vCount * 1024;
vCount := vCount * 1024;

while vValue < vCount do
begin
vLongword := Longword(vValue);
vInteger := Integer(vValue);

// *** longword value: 16777217 shows the first single of trouble ! ;) ***

// convert longword to single
vSingle := vLongword;

// convert single back to longword
vNewLongword := Round( int(vSingle) ); // must round it apperently... this
will probably lead to further problems as well..

// compare the two longwords to detect any loss of precision/conversion
errors
if vNewLongword <> vLongword then
begin
writeln('aborted.');
writeln('Longword conversion error detected at value: ', vValue );
writeln('vSingle: ', vSingle );
writeln('vLongword: ', vLongword );
writeln('vNewLongword: ', vNewLongword );
writeln;
exit;
end;

// convert integer to single
vSingle := vInteger;

// convert single back to integer
vNewInteger := Round( vSingle );

// compare the two integers to detect any loss of precision/conversion
errors
if vNewInteger <> vInteger then
begin
writeln('aborted.');
writeln('Integer conversion error detected at value: ', vValue );
writeln('vSingle: ', vSingle );
writeln('vInteger: ', vInteger );
writeln('vNewInteger: ', vNewInteger );
writeln;
exit;
end;

vValue := vValue + 1;
end;
writeln('done.');
end;

procedure Main;
var
vLongword : longword;
vInteger : integer;

vSingle : single;
begin
writeln('program started');
writeln;

CompareBitPatterns;

CheckForConversionProblems;

{
// test type-casting longword to single
vLongword := 1234567;
vSingle := single(vLongword); // typecast not possible ?! :(
writeln( vSingle );

// test type-casting integer to single
vInteger := -1234567;
vSingle := single(vInteger); // typecast not possible ?! :(
writeln( vSingle );
}

writeln('program finished');
end;

begin
try
Main;
except
on E:Exception do
Writeln(E.Classname, ': ', E.Message);
end;
ReadLn;
end.

// *** End of Program ***

Bye,
Skybuck.

Skybuck Flying

unread,

Jan 4, 2010, 9:00:47 PM1/4/10

to

When it comes to programming the GPU/7900 GTX/Shader Model 3.0/VP40/VP40
profile it seems three floating point formats are interesting/available:

1. "Single floating point format" also know as IEE 754 which is probably the
floating point format used by Delphi in the single type as well as CPU
processors (?):

http://en.wikipedia.org/wiki/Single_precision_floating-point_format

2. GPU Profile FP40 has a special "native?" format:

"
The fixed data type corresponds to a native signed fixed-point integers with
the range [-2.0,+2.0), sometimes called fx12. This type provides 10
fractional bits of precision.
"

I am not sure if this format is available in VP40 profile as well for vertex
shaders ???

3. Undetermined:

It seems the CG manual which I have contains typo's, the description in FP40
profile seems focked up:

"
half
The half data type corresponds to a floating-point encoding with a sign bit,
10 mantissa bits, and 5 exponent bits (biased by 16), sometimes called
s10e5.
float

The half data type corresponds to a standard IEEE 754 single-precision
floating-point encoding with a sign bit, 23 mantissa bits, and 8 exponent
bits (biased by 128), sometimes called s10e5.
"

This is a pretty serious documentation mistake since now I cannot figure out
what was ment... tomorrow I will have to check if updated documentation is
available...

I am kinda curious what the "half" type is all about it seems to be
different...

I also wonder which type is the fastest on the GPU: is it fx12 or is it half
?

Maybe half and fx12 is the same thing it both mentions "10" ? Hmm...

Ultimately the question is how to extract these seperated bit fields from
the floating point format/field so that it can be mixed and
stuffed/extracted/shuffled whatever into other integer fields or floating
point fields for processing...

So as to construct an "extract bytes from floating point field" routine...

The newer profiles for G80 gpu's have some kind of functions/routines for
that unfortunately I can't use those because I don't have a G80 but a
Pre-G80 ;)

I still want the same functionality on the Pre-G80... So hopefully that is
somehow possible ?! ;)

Tomorrow I will try to figure it out... by first trying to find some more
documentations, maybe some formula's will pop up... if nothing pops up then
I might have to give it a thought and try myself ;) :)

Fingers crossed for now...

Bye,
Skybuck.

Skybuck Flying

unread,

Jan 4, 2010, 9:58:13 PM1/4/10

to

Another link describing what is ment with mantissa for cross reference with
the CG manual which mentions it:

http://en.wikipedia.org/wiki/Significand

Delphi and CG both seem to have a function called: FREXP

This function extracts and returns the "Exponent" and the "Fraction/Integer"
(=Mantissa if I understand correctly)

Thus this function could be used for the "float" types in the CG shaders to
extract those bits.

The only other function which might be necessary is the "sign" function
which returns either 1, 0 or -1 depending on the sign...

The sign bit can be anything since it's just a data bit... it needs to be
extracted and appended as well to the extract values...

If I recall correctly two's complement integers have the highest bit set if
negative for signed integers, or if they value is really large. Which is
more or less the same thing...

So if the sign bit is set this means nothing... it could be a small integer
and would be misleading...

I think special treatment might be necessary to prepare the textures for
this...

One trick which comes to mind if the following:

The texture/texel setter checks the two complement value to see if it's 32th
bit is set. If it is set then the floating point value is made negative.
Otherwise it's left/made positive.

Now in the shader it should be pretty easy to decode the sign bit possibly
with a small lookup table to prevent an expensive branch....

[-1] must become 1
[0] must become 0
[1] must become 0

So lookup tables looks like:
[sign(-1)+1=0] = 1
[sign(0)+1=1] = 0
[sign(1)+1=2] = 0

I think this could work... kinda shamefull that 3 temporarely registers have
to be used for this ?! ;)
But maybe it can somehow be stored in the x/y/z/w components to only use one
vector register ?

So far so good, this method might be workable... might be fast...

Maybe there is a better and faster way to extract all bits from a floating
point value.

I have seen some routines/functions but they kinda look complex and hard to
understand which makes them kinda doubtfull... they also didn't seem to
extract all bits ?

I definetly want to extra all 32 bits from a floating point value ! ;)

Bye,
Skybuck.

Skybuck Flying

unread,

Jan 4, 2010, 10:02:48 PM1/4/10

to

However I might have overlooked something when it comes to "sign"...

It could also return "NaN"... that would be a nasty bug that could suddenly
blow up in ones face far down the road ;)

I saw somebody else mention this and took care of it somehow with some
complex formula... etc...

Hmm... I wonder if NaN will truely be problematic... it might be since the
textures don't care with kind of floating point values it are... for them
it's just a bunch of bits ;)

Bye,
Skybuck.

Skybuck Flying

unread,

Jan 4, 2010, 10:07:49 PM1/4/10

to

No, this link mentions NaN is part of the exponential... and not the sign:

http://en.wikipedia.org/wiki/NaN

The comment (4) in the CG manual was probably referring to something else:

CG Manual:

"sign" description:

"
DESCRIPTION
Returns positive one, zero, or negative one for each of the components of x
based on the component's sign.
1) Returns -1 component if the respective component of x is negative.
2) Returns 0 component if the respective component of x is zero.
3) Returns 1 component if the respective component of x is positive.

4) Ideally, NaN returns NaN.
"

Bye,
Skybuck.

Skybuck Flying

unread,

Jan 4, 2010, 10:45:44 PM1/4/10

to

Ok,

I finally found something that might be highly usefull:

Thanks to these two threads/postings:

(Mentioning the instructions and providing further clearifcation on the
functions as "direct access"):

http://developer.nvidia.com/forums/index.php?showtopic=613

(Mentioning the packing and unpacking documentation is in the old FP30
profile docs):

ftp://download.nvidia.com/developer/cg/Cg_Users_Manual.pdf

And indeed here is the lowdown :):

"
Pack and Unpack Functions

The fp30 profile provides a number of functions for packing multiple
floating
point values into a single 32-bit result. Corresponding unpacking functions
are
also provided. These functions map directly to the packing and unpacking
instructions defined by the NV_fragment_program OpenGL extension.

pack_2half()
float pack_2half(float2 a);
float pack_2half(half2 a);

Converts the components of a into a pair of 16-bit floating point values.
The
two converted components are then packed into a single 32-bit result. This
operation can be reversed using the unpack_2half() function.

// C Pseudocode
result = (((half)a.y) << 16) | (half)a.x;

unpack_2half()
half2 unpack_2half(float a);

Unpacks a 32-bit value into two 16-bit floating point values.

// C Pseudocode
result.x = (a >> 0) & 0xFF;
result.y = (a >> 16) & 0xFF;

pack_2ushort()
float pack_2ushort(float2 a);
float pack_2ushort(half2 a);

Converts the components of a into a pair of 16-bit unsigned integers. The
two
converted components are then packed into a single 32-bit return value. This
operation can be reversed using the unpack_2ushort() function.

// C Pseudocode
ushort.x = round(65535.0 * clamp(a.x, 0.0, 1.0));
ushort.y = round(65535.0 * clamp(a.y, 0.0, 1.0));
result = (ushort.y << 16) | ushort.y;

unpack_2ushort()
float2 unpack_2ushort(float a);

Unpacks two 16-bit unsigned integer values from a and scales the results
into
individual floating point values between 0.0 and 1.0.

// C Pseudocode
result.x = ((x >> 0) & 0xFFFF) / 65535.0;
result.y = ((x >> 16) & 0xFFFF) / 65535.0;

pack_4byte()
float pack_4byte(float4 a);
float pack_4byte(half4 a);

Converts the four components of a into 8-bit signed integers. The signed
integers are such that a representation with all bits set to 0 corresponds
to the
value -(128/127), and a representation with all bits set to 1 corresponds to
+(127/127). The four signed integers are then packed into a single 32-bit
result.
This operation may be reversed using the unpack_4byte() function.

// C Pseudocode
ub.x = round(127 * clamp(a.x, -128/127, 127/127) + 128);
ub.y = round(127 * clamp(a.y, -128/127, 127/127) + 128);
ub.z = round(127 * clamp(a.z, -128/127, 127/127) + 128);
ub.w = round(127 * clamp(a.w, -128/127, 127/127) + 128);
result = (ub.w << 24) | (ub.z << 16) | (ub.y << 8) | ub.x;

unpack_4byte()
half4 unpack_4byte(float a);

Unpacks four 8-bit integers from a and scales the results into individual
16-bit
floating point values between -(128/127) and +(127/127).

// C Pseudocode
result.x = (((a >> 0) & 0xFF) - 128) / 127.0;
result.y = (((a >> 8) & 0xFF) - 128) / 127.0;
result.z = (((a >> 16) & 0xFF) - 128) / 127.0;
result.w = (((a >> 24) & 0xFF) - 128) / 127.0;

pack_4ubyte()
float pack_4ubyte(float4 a);
float pack_4ubyte(half4 a);

Converts the four components of a into 8-bit unsigned integers. The unsigned
integers are such that a representation with all bits set to 0 corresponds
to 0.0,
and a representation with all bits set to 1 corresponds to 1.0. The four
unsigned
integers are then packed into a single 32-bit result. This operation can be
reversed using the unpack_4ubyte() function.

// C Psuedocode
ub.x = round(255.0 * clamp(a.x, 0.0, 1.0));
ub.y = round(255.0 * clamp(a.y, 0.0, 1.0));
ub.z = round(255.0 * clamp(a.z, 0.0, 1.0));
ub.w = round(255.0 * clamp(a.w, 0.0, 1.0));
result = (ub.w << 24) | (ub.z << 16) | (ub.y << 8) | ub.x;

unpack_4ubyte()
half4 unpack_4ubyte(float a);

Unpacks the four 8-bit integers in a and scales the results into individual
16-bit
floating point values between 0.0 and 1.0.

// C Pseudocode
result.x = ((a >> 0) & 0xFF) / 255.0;
result.y = ((a >> 8) & 0xFF) / 255.0;
result.z = ((a >> 16) & 0xFF) / 255.0;
result.w = ((a >> 24) & 0xFF) / 255.0;
"

(Had to cut and paste that myself because of shit acrobat reader, I think it
did it ok but you might wanna check anyway ;))

I haven't tried these yet... I would hope and assume that these are
available on profile vp40 an especially fp40 as well ?!

If it does work then this is probably gonna safe me a shit load of work and
complexities =DDDDDDDDDDDDDDDDDD =D

Fingers crossed ;) :)

Bye,
Skybuck =D

Skybuck Flying

unread,

Jan 4, 2010, 11:05:56 PM1/4/10

to

Ok,

This works perfectly ! Such a time safer ! WIEEE =D

CG Shader snippet example:

float4 vColor; // stores color in 4x32 bits r32,g32,b32,a32
half4 vByte; // stores color in 4x8 bits r8,g8,b8,a8

// extract 4x32 bits from 4x32 bit texture (r32,g32,b32,a32)
vColor = texRECT( mTexture, float2( 100, 100 ) );

// extract 4x8 bits from r32
vByte = unpack_4ubyte( vColor.x );

// display byte values as red, green, blue, alpha
ParaOut.mColor.x = vByte.x;
ParaOut.mColor.y = vByte.y;
ParaOut.mColor.z = vByte.z;
ParaOut.mColor.w = vByte.w;

Now on the delphi side for filling the texture.. simply stuff the 4 bytes
into a 32 bit floating point value and that's all there is too it ! ;) =D
Jipppeee ! =D

Also on a side note... the 12 bit fixed point type works as well in the cg
shader:

fixed vTest;
vTest = 1.0;
etc...

Might come in handy for extra speed in the shader... maybe textures could
even be switched to 4x12 bit... but for now I don't see any real benefit to
that...

To me it seems the shaders can work in all modes at the same time... half,
float and fixed.

The textures and texture lookups are just to used to get the data into the
shader "memory/variables/registers".
Actually large 32 bit textures might actually provide some performance
benefits since it requires less lookups ?!? So maybe it can transfer the
data faster into the gpu ?! that would be cool...

However at the same time... it might be transferring to much data... For
"redcode/corewar" instructions I only need 6 bytes... while a "vector" is
now 4x4 bytes = 16 bytes... which means 10 bytes are being transferred for
nothing and might actually be wasted...

So in this case 32 bit textures might actually be bad for bandwidth
performance...

This means the next step is too look into a more suited texture
format/target/layout which fits 6 bytes.

The first thing which comes to mind is "16 bit color components" in rgb
mode...

However maybe it doesn't matter at the hardware layer...

Maybe the hardware layer likes transferring big blocks of say 8, 16, 32 or
even 64 bytes ?!? Like a cpu cache line ?!?

Hmmm this is something I don't know about on the gpu... does it matter or
does it not matter ? Hmmm....

This is a good question...

Bye,
Skybuck.

MitchAlsup

unread,

Jan 5, 2010, 1:32:31 PM1/5/10

to

On Jan 4, 6:59 pm, "Skybuck Flying" <IntoTheFut...@hotmail.com> wrote:
> I am not completely happy with D2007 not supporting these type of wacky
> typecasts... if I want to typecast then what's the fricking problem ?! I
> feel slightly frustrated about that... but at the same time... typecasting
> it in this case would have made little sense but still...

There are floating point implementations that do not keep the value
stored in FP registers in "memory format". That is, when a value is
loaded from memory, the bit pattern in the register is not the same as
the bit pattern in memory. However, when stored back into memory the
original bit pattern can be reconstructed and stored.

So, if you were to load a 64-bit integer bit pattern from memory into
say 64-bit Double Precision, and then store the result back in a 32-
bit format, you would not get the expected result if {the bit pattern
overflowed, needed to be rounded, or otherwise manipulated to be
forced into a 32-bit form}. Thus, these kinds of typecasts are at best
machine specific--not architecture specific, and thereby, unportable.
Over on the integer side, this kind of typecasting would carry no
"surprise" to the result.

Mitch

Skybuck Flying

unread,

Jan 5, 2010, 9:09:29 PM1/5/10

to

While being able to extract bytes is nice and makes the situation a little
bit better...

It's still not quite optimal...

What if I want to extract only 3 bits, and 5 bits ? from a byte...

Hmm.. more work seems necessary ?! ;)

(I am also a little bit worried about gpu random access memory
performance... so maybe I do a test of that later on... to compare against
cpu...)

Bye,
Skybuck.

Skybuck Flying

unread,

Jan 5, 2010, 11:10:35 PM1/5/10

to

I found a pretty good powerpoint presentation which explains most of it in
high gear ;) (nice and neat, super fast/high speed learning ! :) =D)

It does not mention the special bit patterns like NaN, infinite and what
not... ;) and also doesn't explain some little thingies, here is my
"analysis" of it from a newb-point-of-view ;) :):

It explains the decimal system and it's fraction, it then moves onto the
binary equivalent of it, it then provides 2 conversion methods to convert
from decimal floating point, to binary floating point, it then explains
normalization, it then explains the IEEE 754 floating point format, it then
explains the bias of 127 to turn a byte into a shortint, it explains the
radix point and the hidden 1, ofcourse it explains the mantissa and the
exponent as well.. (The exponent could have been named explicitly and
explained explicity which it fails to do... a minor little shortcoming but
anybody with an iq higher than 1 will understand the mantissa and understand
what the exponent is... it's the other bunch of bits, the 8 bits ;) :))

It does not explain what +ve or -ve is... for now I will assume this simply
moves positive or negative (?)

It does give one example for negative, and one example for positive... for
both conversions (d-to-b, b-to-d) maybe it should have included examples for
zero, and special bit patterns... that might have made it complete ;) :)

Not bad for a freeby ! ;)

I give it a newby-friendly-rating of 9.0 out of 10.0 ;):

http://academic.regis.edu/dockel/CS208/LectureSlides/Floating_nov_03.ppt

I also found another one, just to compare against, this one is real messy
but probably contains the same information more or less ;)

http://www.cs.iupui.edu/~n305/fall09/slides/t08BInformationRepresentationFloatingPoint.ppt

I give it a newby-unfriendly-rating of 3.0 out of 10.0 ;) (Unusuable for
newbs ;) :))
But a 5.0 out of 10.0 because of the other one... it's always good to have
multiple docs to compare ;)

Also do not use "Save As" from IE8 menu... it doesn't save the file
correctly ?! (Stupid IE8.0 ?!)

Simply copy&paste the link and choose Save As... failing that... try
clicking the powerpoint presentation... and it might re-ask to open or
download it... definetly worth a looksy and a store on harddisk ! ;) :)

(I think these presentations will come in highly handy for creating some
kind of pseudo code and finally real code to do "bit perfect" extraction
from floating point (binary) values for systems that lack direct bit
extraction and bit manipulation ;) by using for example the multiplication,
or subtraction method and "extracting" the integer part to get the bit...
assuming "extract integer part" is available on the system, maybe an even
faster algorithm can be designed for systems that already have "extract
fractional part" and such ;), then that works doesn't have to be coded
anymore... just "partial subtraction" to get some of the bits of it ;))

Bye,
Bye,
Skybuck =D

Skybuck Flying

unread,

Jan 5, 2010, 11:42:24 PM1/5/10

to

Well this is all highly interesting... especially for processors missing
certain features/instructions...

But for the GPU/CG Shaders/Shader Model 3.0... I think it's probably easiest
and fastest to simply extract the bytes/words/integers from the floating
points, and stuff them into an int, and then use the special shifts, or's
and and's... to extract the necessary bits... I would hope that this would
be the fastest way to do it... but I could be wrong... maybe the subtraction
method might be fast too... maybe even faster... especially if number of
bits to extract is constants/know up front... then a loop is not needed...
and maybe something special can be used like so in a simple decimal example:

Remainder = Value;
Bit3 = Remainder div 8; Remainder = Remainder mod 8;
Bit2 = Remainder div 4; Remainder = Remainder mod 4;
Bit1 = Remainder div 2; Remainder = Remainder mod 2;
Bit0 = Remainder;

Then to calculate a new value it could be done as follows:

NewValue = Bit0 * 1 + Bit1 * 2 + Bit2 * 4 + Bit * 8;

Maybe this could even work with floating points ;) :)

This will also work for fractions something like:

Remainder = Fraction;
Bit1 = (Remainder / (1/2) ) mod 1; Remainder = Remainder - (1/2) * Bit1;
Bit2 = (Remainder / (1/4) ) mod 1; Remainder = Remainder - (1/4) * Bit2;
Bit3 = (Remainder / (1/8) ) mod 1; Remainder = Remainder - (1/8) * Bit3;
Bit4 = (Remainder / (1/16) ) mod 1; Remainder = Remainder - (1/16) * Bit4;
Etc ;) :)

This just an idea, untested... I don't even know if it's correct... this my
first try ever ! =D

Maybe there are faster formula's too ;) :)

Anyway finally to get fraction again.. or integer part simply multiply the
bits with the position value...
So for fractions:
Bit1 * (1/2=0.5) +
Bit2 * (1/4=0.25) +
Bit3 * (1/8=0.125) etc;

Bye,
Skybuck.

Skybuck Flying

unread,

Jan 6, 2010, 1:09:04 AM1/6/10

to

Also here is a nice floating point converter java applet, might come in
handy for some quick tests ;)

http://www.h-schmidt.net/FloatApplet/IEEE754.html

Bye,
Skybuck.

Skybuck Flying

unread,

Jan 8, 2010, 12:47:53 PM1/8/10

to

Hello,

Today I remembered a trick from solvers... it's called binary variables.

It can be used to make decisions without actually requiring a branch.

The idea is as follows:

BinaryVariable * Potential Answer

This way many potential answers can be calculated.

The binary variable could be determined with a little trick out of the
processor world;

Compares are done with subtraction.

A-B

If they equal then the result is zero.
If A is greater than b then the result is positive.
If A is smaller then b then the result is negative.

So this would give 0, +1, -1

How to use this further is left to your imagination ! ;) :)

Bye,
Skybuck =D