Floting Point Saturation

thunder

unread,

Nov 3, 2009, 10:54:16 AM11/3/09

to

Hi all

I have a compute engine which generates an 8-bit floating number,
fp_num[7:0]. The format of this 8-bit floating point number is 3-bit
signed exponent and 5-bit mantissa (1-bit mantissa sign and 4-bit
mantissa magnitude) ie

7 6 5 4 3
2 1 0
exp exp exp mantissa_sign mantissa mantissa mantissa mantissa

I need to make sure that the final value of fp_num i send out to the
next block does not exceed the range [+num_clamp, -num_clamp], where
num_clamp is also an 8-bit FP number with the same data bit
representation as above, except the mantissa sign of num_clamp is
always zero (ie the mantissa is always positive).

One way i thought of doing this is to translate both fp_num and
num_clamp to a fixed point number and then determine if fxd_num is
less than or greater than fxd_num_clamp.

I wondered if anyone has any other suggesstions to determine this
floating point saturation, which might be simpler and/or better.

Thanks in advance

J

glen herrmannsfeldt

unread,

Nov 3, 2009, 1:33:35 PM11/3/09

to

In comp.dsp thunder <ja...@hotmail.com> wrote:

> I have a compute engine which generates an 8-bit floating number,
> fp_num[7:0]. The format of this 8-bit floating point number is 3-bit
> signed exponent and 5-bit mantissa (1-bit mantissa sign and 4-bit
> mantissa magnitude) ie

> 7 6 5 4 3
> 2 1 0
> exp exp exp mantissa_sign mantissa mantissa mantissa mantissa

> I need to make sure that the final value of fp_num i send out to the
> next block does not exceed the range [+num_clamp, -num_clamp], where
> num_clamp is also an 8-bit FP number with the same data bit
> representation as above, except the mantissa sign of num_clamp is
> always zero (ie the mantissa is always positive).

> One way i thought of doing this is to translate both fp_num and
> num_clamp to a fixed point number and then determine if fxd_num is
> less than or greater than fxd_num_clamp.

Without making comments on the usefulness of such a small floating
point value...

It is usual to use a biased exponent instead of a signed exponent.
Since I don't know what you mean by signed exponent, I will explain
biased exponent.

In three bits, 000 would be the smallest (most negative) exponent
and 111 the largest (most positive). You get from twos complement
to biased by inverting the sign bit.

With a biased exponent on the left, normalized positive floating
point numbers can be compared directly as unsigned values.
In your case, all you need to do is set the sign bit to positive
(usually 0), and compare.

If the exponent is twos complement, invert the sign
bit before comparing.

-- glen

Vladimir Vassilevsky

unread,

Nov 4, 2009, 3:25:32 PM11/4/09

to

Compare exponents
Compare mantissas
What's a problem?

VLV

glen herrmannsfeldt

unread,

Nov 4, 2009, 3:45:33 PM11/4/09

to

In comp.dsp Vladimir Vassilevsky <nos...@nowhere.com> wrote:

> Compare exponents
> Compare mantissas
> What's a problem?

Twice as much work as you need to do.

Note that the PDP-10 has one compare instruction for both
fixed and floating point numbers.

-- glen

Jerry Avins

unread,

Nov 4, 2009, 4:01:19 PM11/4/09

to

You can generate a mere 256 numbers. How many are out of range? At
worst, you could search a too-big list and a too-small list to see if
your number is on it.

How would you compare ordinary signed integers? Mask off the mantissas
with AND 0x1F compare exponents. If .TG. or .LT., you have the enswer.
If .EQ., shift both left 5 and compare again.

Think about what happens if you XOR with 0x90 to make mantissa and
exponent offset binary instead of two's complement before doing an
unsigned integer compare.

Jerry
--
Engineering is the art of making what you want from things you can get.
��

Vladimir Vassilevsky

unread,

Nov 4, 2009, 4:10:19 PM11/4/09

to

glen herrmannsfeldt wrote:

> In comp.dsp Vladimir Vassilevsky <nos...@nowhere.com> wrote:
>
>
>>Compare exponents
>>Compare mantissas
>>What's a problem?
>
>
> Twice as much work as you need to do.

Actually less of work. Both compare operations are narrow and could be
done in parallel. The question is likely in the context of FPGA.

> Note that the PDP-10 has one compare instruction for both
> fixed and floating point numbers.

This is possible if the floats are in IEEE754-like representation and
handling of NANs and denormals not required.

Vladimir Vassilevsky
DSP and Mixed Signal Design Consultant
http://www.abvolt.com

thunder

unread,

Nov 6, 2009, 4:27:22 AM11/6/09

to

Thanks for the input.

Just to clarify a few things ...

a) By signed exponent, i mean the exponet is a 3-bit signed 2's
complement number. THis means that the exponent has a range of [-4,
+3].

b) Exponent bias - An exponent bais has already been added. The
exponent bias is also a signed 3-bit 2's complement number. Thus the
addition of the exponent bias generated a signed 4-bit 2's complement
number. However, after the addition of the exponent bias, the
resultant exponent is then saturated to be again a 3-bit signed 2's
complement number with a range of [-4, +3]. Therefore, the exponent
part of fp_num is a 3-bit signed 2's complement number having a range
of [-4, +3].

c) Also the mantissa of fp_num is a normalised mantissa. The mantissa
represents in this case a purely fractional part. Thus the
normalisation of the mantissa means the MSB of the mantissa is set and
implied. Thus for fp_num, we have fp_num[4] = mantissa sign and fp_num
[3:0] being the mantissa magnitude. However, the mantissa magnitude is
normalised. Thus actual_mantissa_magnitude = '1' & fp_num[3:0] (where
& is concatenation in this instance).

Thanks

J

thunder

unread,

Nov 6, 2009, 4:32:09 AM11/6/09

to

The compares will be done in parallel.

Furthermore, the Floating numbers are internal representation and do
not conform to IEEE 754 representation. Thus handling of NANs and
denormals are not required.

Thanks

J

glen herrmannsfeldt

unread,

Nov 6, 2009, 5:13:30 AM11/6/09

to

In comp.dsp thunder <ja...@hotmail.com> wrote:

(snip on comparing floating point values)

> Thanks for the input.

> Just to clarify a few things ...

> a) By signed exponent, i mean the exponet is a 3-bit signed 2's
> complement number. THis means that the exponent has a range of [-4,
> +3].

All that I know use a biased representation instead of twos
complement. The actual difference is only in the sign bit.
With a biased exponent instead of twos complement, you use
an unsigned compare instead of a signed compare. It just makes
things a little easier.

> b) Exponent bias - An exponent bais has already been added. The
> exponent bias is also a signed 3-bit 2's complement number. Thus the
> addition of the exponent bias generated a signed 4-bit 2's complement
> number. However, after the addition of the exponent bias, the
> resultant exponent is then saturated to be again a 3-bit signed 2's
> complement number with a range of [-4, +3]. Therefore, the exponent
> part of fp_num is a 3-bit signed 2's complement number having a range
> of [-4, +3].

With a biased exponent the range is still -4 to +3, but the bits
used to represent the value are different. There should be no
discussion of twos complement here at all.

> c) Also the mantissa of fp_num is a normalised mantissa. The mantissa
> represents in this case a purely fractional part. Thus the
> normalisation of the mantissa means the MSB of the mantissa is set and
> implied. Thus for fp_num, we have fp_num[4] = mantissa sign and fp_num
> [3:0] being the mantissa magnitude. However, the mantissa magnitude is
> normalised. Thus actual_mantissa_magnitude = '1' & fp_num[3:0] (where
> & is concatenation in this instance).

I prefer 'significand' to 'mantissa', but otherwise I think that
sounds fine.

-- glen

Terje Mathisen

unread,

Nov 6, 2009, 5:18:55 AM11/6/09

to

thunder wrote:
> b) Exponent bias - An exponent bais has already been added. The
> exponent bias is also a signed 3-bit 2's complement number. Thus the
> addition of the exponent bias generated a signed 4-bit 2's complement
> number. However, after the addition of the exponent bias, the
> resultant exponent is then saturated to be again a 3-bit signed 2's
> complement number with a range of [-4, +3]. Therefore, the exponent
> part of fp_num is a 3-bit signed 2's complement number having a range
> of [-4, +3].
>
> c) Also the mantissa of fp_num is a normalised mantissa. The mantissa
> represents in this case a purely fractional part. Thus the
> normalisation of the mantissa means the MSB of the mantissa is set and
> implied. Thus for fp_num, we have fp_num[4] = mantissa sign and fp_num
> [3:0] being the mantissa magnitude. However, the mantissa magnitude is
> normalised. Thus actual_mantissa_magnitude = '1'& fp_num[3:0] (where
> & is concatenation in this instance).

This means that all positive numbers can be compared with an unsigned
(or signed, since the sign bit is zero) comparison, right?

For negative numbers it is exactly the same, except the result must be
inverted.

When the values have opposite signs, the result is the inverse of the
sign bit:

int cmp8(int8_t a, int8_t b)
{
if ((a ^ b) & 0x80) {
if (a < 0) return -1; // b is larger
return 1; // a is larger (positive)
}
if (a < 0) return ((b & 0x7f) - (a & 0x7f));
return ((a & 0x7f) - (b & 0x7f));
}

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"