safely casting float to int

luca.b...@gmail.com

unread,

Oct 18, 2018, 6:01:15 AM10/18/18

to

Hi, I'd like to know if there is a safe way to cast a float to int

float value; int n;
...
n = int(value);

But proper code should first do a bound check.
A first attempt is:

#include <limits.h>

int truncate_float_to_int(float value) {
if (value > INT_MAX)
return INT_MAX;
if (value < INT_MIN)
return INT_MIN;
return int(value);
}

float value; int n;
...
n = truncate_float_to_int(value);

But that is not really correct: in the expression 'value > INT_MAX' the INT_MAX value is converted to a float, so it is rounded to one of the two nearest floating point numbers, so we may get a greater value.

For example, consider using 32 bit int type and 32 bit IEEE 754 float type.
I found this blog post:

https://blog.forrestthewoods.com/perfect-prevention-of-int-overflows-f000f7e893ee

Author says:
"Now is when things start to get fun. The largest int is 2147483647 (2^31–1). If you convert that int to a float it rounds up to 2147483650f."

So he gets a greater value.
How should one properly check the conversion of float to int?

Thank you very much!

luca.b...@gmail.com

unread,

Oct 18, 2018, 6:13:38 AM10/18/18

to

Ok I just found that

int truncate_float_to_int(float value) {
if (value >= INT_MAX)
return INT_MAX;
if (value <= INT_MIN)
return INT_MIN;
return int(value);
}

is correct.
Sorry to have bothered you with my question!
Thank you

Ben Bacarisse

unread,

Oct 18, 2018, 6:50:04 AM10/18/18

to

luca.b...@gmail.com writes:

> Hi, I'd like to know if there is a safe way to cast a float to int
>
> float value; int n;
> ...
> n = int(value);

That's C++. Here, we write (int)value. I'm not sure if there will be
any difference in C++ but the way C++ provides limits is certainly very
different.

> But proper code should first do a bound check.
> A first attempt is:
>
>
> #include <limits.h>
>
> int truncate_float_to_int(float value) {
> if (value > INT_MAX)
> return INT_MAX;
> if (value < INT_MIN)
> return INT_MIN;
> return int(value);
> }
>
> float value; int n;
> ...
> n = truncate_float_to_int(value);
>
>
> But that is not really correct: in the expression 'value > INT_MAX'
> the INT_MAX value is converted to a float, so it is rounded to one of
> the two nearest floating point numbers, so we may get a greater value.
>
> For example, consider using 32 bit int type and 32 bit IEEE 754 float
> type.

For that case, I think I'd simply compare to (double)INT_MAX and
(double)INT_MIN, but let's consider the more general problem...

> I found this blog post:
>
> https://blog.forrestthewoods.com/perfect-prevention-of-int-overflows-f000f7e893ee
>
> Author says: "Now is when things start to get fun. The largest int is
> 2147483647 (2^31–1). If you convert that int to a float it rounds up
> to 2147483650f."

(That's an odd result. I'd expect 2147483648 but then C offers very few
guarantees about exactly what floating point implementations should do.)

> So he gets a greater value.
> How should one properly check the conversion of float to int?

A first thought is to use log2f(value) but then you need to worry about
integer padding bits. I.e. if you get

log2f(value) < CHAR_BIT * sizeof(int) - 1

you may still not be able to convert without overflow. You can check
for padding bits (INT_MAX >> (CHAR_BIT * sizeof(int) - 2) == 1 means you
are OK) but it's getting complicated.

What's more, you have to handle negative numbers slightly differently
because C permits representations other than twos complement.

All this makes me want to ask how portable, to bizarre implementations,
you need this to be.

By the way, if you can assume you have FLT_RADIX == 2, ilogbf might be a
better function to use, but it still does not solve the problem by
itself.

--
Ben.

luca.b...@gmail.com

unread,

Oct 18, 2018, 9:04:41 AM10/18/18

to

Hi Ben!

> writes:
>
> > Hi, I'd like to know if there is a safe way to cast a float to int
> >
> > float value; int n;
> > ...
> > n = int(value);
>
> That's C++. Here, we write (int)value. I'm not sure if there will be
> any difference in C++ but the way C++ provides limits is certainly very
> different.
>

ouch, yeah I was wrong, I really meant to use C style casts here.
I'm coding in C++ right now, that's the reason of my mistake ;)

> > But proper code should first do a bound check.
> > A first attempt is:
> >
> >
> > #include <limits.h>
> >
> > int truncate_float_to_int(float value) {
> > if (value > INT_MAX)
> > return INT_MAX;
> > if (value < INT_MIN)
> > return INT_MIN;
> > return int(value);
> > }
> >
> > float value; int n;
> > ...
> > n = truncate_float_to_int(value);
> >
> >
> > But that is not really correct: in the expression 'value > INT_MAX'
> > the INT_MAX value is converted to a float, so it is rounded to one of
> > the two nearest floating point numbers, so we may get a greater value.
> >
> > For example, consider using 32 bit int type and 32 bit IEEE 754 float
> > type.
>
> For that case, I think I'd simply compare to (double)INT_MAX and
> (double)INT_MIN, but let's consider the more general problem...
>

I'd say that

int truncate_float_to_int(float value) {
if (value >= INT_MAX)
return INT_MAX;

if (value <= INT_MIN)
return INT_MIN;

return (int) value;
}

is indeed correct:

if ((float)INT_MAX) rounds from 2147483647 to 2147483650f, that mean no floating point number can ever be 2147483649.x, neither 2147483648.x, neither 2147483647.x. So it is either 2147483646.x or below, in which case correctly truncates to an int, or it is >= 2147483650f, in which case we should just return INT_MAX. Cannot ever be in between.

So the thing here is that you need to compare with >=, <= and not with >, <.
Using <, > makes truncate_float_to_int(NIT_MAX) Undefined Behaviour.

> > I found this blog post:
> >
> > https://blog.forrestthewoods.com/perfect-prevention-of-int-overflows-f000f7e893ee
> >
> > Author says: "Now is when things start to get fun. The largest int is
> > 2147483647 (2^31–1). If you convert that int to a float it rounds up
> > to 2147483650f."
>
> (That's an odd result. I'd expect 2147483648 but then C offers very few
> guarantees about exactly what floating point implementations should do.)
>

The somewhat counter intuitive fact about floating point is that at some point you start missing entire units, so the gap between subsequent floating point numbers is not less than 1, but starts to be higher than 1.

As outlined in the blog post, the next floating point subsequent to 1000000000f is not "about" 1000000001, but "about" 1000000064 (!) so here the difference between a floating point number and its subsequent is about 64. And 1000000000 is a valid 32 bit signed integer!
That's because IEEE 754 floating point is made by mantissa and exponent. If they were implemented by storing integral part and fractional part then you'd always have that the difference between subsequents is less than 1.

Thank you very much Ben!

Anton Shepelev

unread,

Oct 18, 2018, 9:16:13 AM10/18/18

to

Luca:

>Hi, I'd like to know if there is a safe way to cast a float
>to int
>
>float value; int n; ... n = int(value);
>
>But proper code should first do a bound check.
>

>int truncate_float_to_int(float value) {
> if (value >= INT_MAX)
> return INT_MAX;

> if (value <= INT_MIN)
> return INT_MIN;
> return (int) value;
>}

Safe in what sense? That function will silently ignore an
out-of-range value. What about:

/* Convert float to int. Return values:
0: success
-1: lower bound exceeded
1: higher bound exceeded
*/
int ftoi( float f, int * i );

--
() ascii ribbon campaign - against html e-mail
/\ http://preview.tinyurl.com/qcy6mjc [archived]

luca.b...@gmail.com

unread,

Oct 18, 2018, 9:44:37 AM10/18/18

to

Hi Anton!
Well, yeah that's much better style!
For my use case it's ok to silently clamp to int representable values.
Thanks!

Ben Bacarisse

unread,

Oct 18, 2018, 9:48:50 AM10/18/18

to

Yes, me too. I did not clock that the only problematic case is when
there is equality so changing > to >= (and < to <=) fixes it.

<snip>

> Thank you very much Ben!

Oh I just complicated things!

--
Ben.

Bart

unread,

Oct 18, 2018, 9:54:16 AM10/18/18

to

On 18/10/2018 14:04, luca.b...@gmail.com wrote:

>> (That's an odd result. I'd expect 2147483648 but then C offers very few
>> guarantees about exactly what floating point implementations should do.)
>>
>
> The somewhat counter intuitive fact about floating point is that at some point you start missing entire units, so the gap between subsequent floating point numbers is not less than 1, but starts to be higher than 1.
>
> As outlined in the blog post, the next floating point subsequent to 1000000000f is not "about" 1000000001, but "about" 1000000064 (!) so here the difference between a floating point number and its subsequent is about 64. And 1000000000 is a valid 32 bit signed integer!

A 32-bit float type has only 23 bits of precision (of which the first
bit is usually 1).

A 32-bit integer type as 32 bits of precision, or 31 bits looking only
the magnitude of a signed integer.

So there are 8 bits missing. I would expect values around INTMAX to be
these, separated by 256:

2 147 483 136
2 147 483 392

with the next value being, I think:

2 147 483 648

(Here there might be a change in exponent, so also a shift in mantissa).

But this value is INTMAX+1. I'm not sure where your results came from.

The above seem to be verified with this program (where 'float' is 32-bit
IEEE format):

float x = 2147483648;

printf("%f %08X\n",x, *(int*)&x);

--(*(int*)&x);
printf("%f\n",x);
--(*(int*)&x);
printf("%f\n",x);

The decrements work on the integer representation of a float, and change
it by one LSB at a time.

--
bart

luca.b...@gmail.com

unread,

Oct 18, 2018, 9:54:48 AM10/18/18

to

No at all, really! :)

luca.b...@gmail.com

unread,

Oct 18, 2018, 9:58:42 AM10/18/18

to

Good point, I'll check it out. Took what's in the blog post for granted.

luca.b...@gmail.com

unread,

Oct 18, 2018, 10:11:37 AM10/18/18

to

Yep, you're right ¯\_(ツ)_/¯

james...@alumni.caltech.edu

unread,

Oct 18, 2018, 10:28:41 AM10/18/18

to

On Thursday, October 18, 2018 at 9:04:41 AM UTC-4, luca.b...@gmail.com wrote:
> Hi Ben!
...

> > > Author says: "Now is when things start to get fun. The largest int is
> > > 2147483647 (2^31–1). If you convert that int to a float it rounds up
> > > to 2147483650f."
> >
> > (That's an odd result. I'd expect 2147483648 but then C offers very few
> > guarantees about exactly what floating point implementations should do.)
> >
>
> The somewhat counter intuitive fact about floating point is that at
> some point you start missing entire units, so the gap between
> subsequent floating point numbers is not less than 1, but starts to be
> higher than 1.
>
> As outlined in the blog post, the next floating point subsequent to
> 1000000000f is not "about" 1000000001, but "about" 1000000064 (!) so
> here the difference between a floating point number and its subsequent
> is about 64. And 1000000000 is a valid 32 bit signed integer!
> That's because IEEE 754 floating point is made by mantissa and
> exponent. If they were implemented by storing integral part and
> fractional part then you'd always have that the difference between
> subsequents is less than 1.

The issue you raise is a valid one, but is unlikely to apply in this
case. What the C standard says about such conversions is "When a value
of integer type is converted to a real floating type, if the value being
converted can be represented exactly in the new type, it is unchanged. If
the value being converted is in the range of values that can be
represented but cannot be represented exactly, the result is either the
nearest higher or nearest lower representable value, chosen in an
implementation-defined manner. ..." (6.3.1.4p2).
In the case you're talking about, 1000000000f == 5^9 * 2^9 ==
1953125*2^9. If FLT_RADIX -- 2 (by far, the most common case), then the
mantissa needs no more than 21 bits to represent 1953125 exactly. IEEE
754 single precision has 22 mantissa bits, with an assumed leading bit
of 1, so that number can be represented exactly. However, 1000000001F
isn't divisible by 2, so it requires a full 30 bits of mantissa to be
represented exactly. The next representable value is 1000000064, as you
say.

However, the number in this case is 2147483647, which is not exactly
representable as an IEEE 754 single precision number. However,
2147483648 is exactly 2^31, and should therefore be exactly
representable if FLT_RADIX is a power of 2, in any floating point format
that conforms to the requirements of the C standard. Therefore, on such
a system, the only values that are permitted by 6.3.1.4p2 for
(float)2147483647 are 214783648.0 or the representable value just below
that one. An implementation that produces 2147483650 must either have
FLT_RADIX that is not a power of 2 (pretty uncommon, nowadays) or is
non-conforming.

David Brown

unread,

Oct 18, 2018, 10:33:28 AM10/18/18

to

On 18/10/18 15:44, luca.b...@gmail.com wrote:
> Il giorno giovedì 18 ottobre 2018 15:16:13 UTC+2, Anton Shepelev ha scritto:
>> Luca:
>>
>>> Hi, I'd like to know if there is a safe way to cast a float
>>> to int
>>>
>>> float value; int n; ... n = int(value);
>>>
>>> But proper code should first do a bound check.
>>>
>>> int truncate_float_to_int(float value) {
>>> if (value >= INT_MAX)
>>> return INT_MAX;
>>> if (value <= INT_MIN)
>>> return INT_MIN;
>>> return (int) value;
>>> }
>>
>> Safe in what sense? That function will silently ignore an
>> out-of-range value. What about:
>>
>> /* Convert float to int. Return values:
>> 0: success
>> -1: lower bound exceeded
>> 1: higher bound exceeded
>> */
>> int ftoi( float f, int * i );
>>

> Hi Anton!
> Well, yeah that's much better style!
> For my use case it's ok to silently clamp to int representable values.
> Thanks!
>

It is better style if it suits your needs - worse style if it does not.

And you may also need to consider NaN's or other odd values.

luca.b...@gmail.com

unread,

Oct 18, 2018, 10:34:07 AM10/18/18

to

> However, the number in this case is 2147483647, which is not exactly
> representable as an IEEE 754 single precision number. However,
> 2147483648 is exactly 2^31, and should therefore be exactly
> representable if FLT_RADIX is a power of 2, in any floating point format
> that conforms to the requirements of the C standard. Therefore, on such
> a system, the only values that are permitted by 6.3.1.4p2 for
> (float)2147483647 are 214783648.0 or the representable value just below
> that one. An implementation that produces 2147483650 must either have
> FLT_RADIX that is not a power of 2 (pretty uncommon, nowadays) or is
> non-conforming.

Really GREAT explanation!

luca.b...@gmail.com

unread,

Oct 18, 2018, 10:46:03 AM10/18/18

to

> And you may also need to consider NaN's or other odd values.

+/-INF and NaN

What does the C standard says about comparing (greater than, less than) NaN with normal values, .e.g. the expression

NaN > 3.0

Is it specified to be false?

Ben Bacarisse

unread,

Oct 18, 2018, 11:01:28 AM10/18/18

to

Bart <b...@freeuk.com> writes:

> ... I would expect values around INTMAX to be
> these, separated by 256:

No, by 128, I think.

> 2 147 483 136

2 147 483 264

> 2 147 483 392
>
> with the next value being, I think:
>
> 2 147 483 648

2 147 483 520

comes between those two.

> The above seem to be verified with this program (where 'float' is
> 32-bit IEEE format):

When I run your code it gives the numbers I'd expect, not the ones you
gave. Do you not see the same?

> float x = 2147483648;
>
> printf("%f %08X\n",x, *(int*)&x);
>
> --(*(int*)&x);
> printf("%f\n",x);
> --(*(int*)&x);
> printf("%f\n",x);
>
> The decrements work on the integer representation of a float, and
> change it by one LSB at a time.

This depends on the representation (particularly the endianess), but
there is a portable way to do this in C99:

float x = INT_MAX;
for (int i = 0; i < 5; i++) {
printf("%.0f %a\n", x, x);
x = nextafterf(x, 0);
}

The nextafterf call, which returns the next float in the direction of
the second argument, and the %a format. If you are investigating the
properties of floating point, %a is very helpful. It gives you a hex
representation (to maximum precision) with the sign shown as usual and
the exponent in decimal:

2147483648 0x1p+31
2147483520 0x1.fffffep+30
2147483392 0x1.fffffcp+30
2147483264 0x1.fffffap+30
2147483136 0x1.fffff8p+30

--
Ben.

james...@alumni.caltech.edu

unread,

Oct 18, 2018, 11:12:02 AM10/18/18

to

The C standard leaves that unspecified - the only relevant thing it says
is "Relational operators may raise the ‘‘invalid’’ floating-point
exception when argument values are NaNs. For a NaN and a numeric value,
or for two NaNs, just the unordered relationship is true.244)"
(7.12.14p1). If you don't want to have the "invalid" floating-point
exception raised, you can use isgreater(NaN, 3.0), described in 7.12.14.1.

If the __STDC_IEC_559__ macro is pre#defined by your implementation,
then it must conform to the additional requirements listed in Annex F,
which includes "The relational and equality operators provide IEC 60559
comparisons." (F.3p1). IEC 60559 is exactly equivalent to IEEE 754, and
requires that all comparisions involving NaNs are false (including a==a,
when the value of a is a NaN). isunordered(a,b) will be return true if
either a or b has a value that is a NaN.

Bart

unread,

Oct 18, 2018, 12:08:58 PM10/18/18

to

On 18/10/2018 16:01, Ben Bacarisse wrote:
> Bart <b...@freeuk.com> writes:
>
>> ... I would expect values around INTMAX to be
>> these, separated by 256:
>
> No, by 128, I think.
>
>> 2 147 483 136
>
> 2 147 483 264
>
>> 2 147 483 392
>>
>> with the next value being, I think:
>>
>> 2 147 483 648
>
> 2 147 483 520
>
> comes between those two.
>
>> The above seem to be verified with this program (where 'float' is
>> 32-bit IEEE format):
>
> When I run your code it gives the numbers I'd expect, not the ones you
> gave. Do you not see the same?

Well, it gave numbers ending with 648, 520, and 392. 392 matched my
list, but the 520 I thought matched something I'd worked out on paper.
But that was erroneous. (I was too hasty because I had to go out.)

>> float x = 2147483648;
>>
>> printf("%f %08X\n",x, *(int*)&x);
>>
>> --(*(int*)&x);
>> printf("%f\n",x);
>> --(*(int*)&x);
>> printf("%f\n",x);
>>
>> The decrements work on the integer representation of a float, and
>> change it by one LSB at a time.
> This depends on the representation (particularly the endianess),

After posting I realised that this decrement would affect the exponent,
which I hadn't intended to touch.

Starting with ....136, and incrementing, it jumps by +128 as you say.
But from ...648 (with a new exponent), it will be +256.

but
> there is a portable way to do this in C99:
>
> float x = INT_MAX;
> for (int i = 0; i < 5; i++) {
> printf("%.0f %a\n", x, x);
> x = nextafterf(x, 0);
> }

(The msvcrt library doesn't have that, only _nextafter() which works on
doubles. But some other compilers will support it.)

> The nextafterf call, which returns the next float in the direction of
> the second argument, and the %a format. If you are investigating the
> properties of floating point, %a is very helpful. It gives you a hex
> representation (to maximum precision) with the sign shown as usual and
> the exponent in decimal:
>
> 2147483648 0x1p+31
> 2147483520 0x1.fffffep+30
> 2147483392 0x1.fffffcp+30
> 2147483264 0x1.fffffap+30
> 2147483136 0x1.fffff8p+30

Yes, although it's disconcerting that those hex representations change
by 4 rather than by 1. (I understand it's because 6 hex digits, or 24
bits, are used to show 22 bits of the mantissa, and are 'left-justified'.

I think I'll stick with my %08X...

--
bart

Bart

unread,

Oct 18, 2018, 12:19:13 PM10/18/18

to

By 2 I mean. Which gives 23 bits of mantissa /plus/ that 1?

I must have misremembered it as 23 in total.

(The 24 bits makes the discrepancy with 31-bits of int32, 7 bits, and
would make the step +128 as you said.)

--
bart

Keith Thompson

unread,

Oct 18, 2018, 12:22:10 PM10/18/18

to

Ben Bacarisse <ben.u...@bsb.me.uk> writes:
> luca.b...@gmail.com writes:
[...]

>> https://blog.forrestthewoods.com/perfect-prevention-of-int-overflows-f000f7e893ee
>>
>> Author says: "Now is when things start to get fun. The largest int is
>> 2147483647 (2^31–1). If you convert that int to a float it rounds up
>> to 2147483650f."
>
> (That's an odd result. I'd expect 2147483648 but then C offers very few
> guarantees about exactly what floating point implementations should do.)

Yes, that's a very odd result.

In hexadecimal, 2147483647 (2**31-1) is 0x7fffffff, and the
supposed result of the conversion, 2147483650f, is 0x80000002.
float is typically 32 bits, which means it has 23 or so mantissa
bits, so 0x80000002 shouldn't be exactly representable.

On my system, converting 2147483647 from int to float rounds to
2147483648 (which does produce an overflow when converted back
to int).

These values of type float are consecutive (on a typical implementation
using 32-bit IEEE representation):
2147483520.0
2147483648.0
2147483904.0
or, in hexadecimal:
0x7fffff80
0x80000000
0x80000100
(You can see the interval doubling at a power of two.)

I think it's simply an error in the article.

--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Ben Bacarisse

unread,

Oct 18, 2018, 3:17:30 PM10/18/18

to

I wasn't suggesting you don't print the bits if that what you need to
see. I was telling people about another handy format that is often
overlooked.

With Intel FP, the layout means that when you decrement (or increment) a
"fake" int, you can, as you discovered, alter the exponent and get the
correct result. The %08X output hides that to some extent, but the %a
format make it plain. Sometimes it's helpful to have more information.
I find it particularly handy to have both when trouble shooting
calculations with denormalized floats.

--
Ben.

Ben Bacarisse

unread,

Oct 18, 2018, 3:38:52 PM10/18/18

to

I don't really know what you have sorted out here, but IEEE floats have
a sign bit, 8 exponent bits and a 23-bit mantissa. There is an implicit
leading 1, so the format has 24-bit precision.

--
Ben.

Jorgen Grahn

unread,

Oct 18, 2018, 4:09:58 PM10/18/18

to

On Thu, 2018-10-18, luca.b...@gmail.com wrote:
> Il giorno giovedì 18 ottobre 2018 15:16:13 UTC+2, Anton Shepelev ha scritto:
>> Luca:
>>
>> >Hi, I'd like to know if there is a safe way to cast a float
>> >to int
>> >
>> >float value; int n; ... n = int(value);
>> >
>> >But proper code should first do a bound check.
>> >
>> >int truncate_float_to_int(float value) {
>> > if (value >= INT_MAX)
>> > return INT_MAX;
>> > if (value <= INT_MIN)
>> > return INT_MIN;
>> > return (int) value;
>> >}
>>
>> Safe in what sense? That function will silently ignore an
>> out-of-range value. What about:

...

> Well, yeah that's much better style!
> For my use case it's ok to silently clamp to int representable values.
> Thanks!

It would have been better if you explained what your use case is.

It's often hard to find the solution to a very generic problem, but
easy in a specific instance. I suspect your problem is like that.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Bart

unread,

Oct 18, 2018, 4:15:29 PM10/18/18

to

So that's an extra 1 bit of precision, which means int values can be
stepped more finely that I'd thought.

(So +128 for my example rather than +256, which is where you came in.
The step depends on the exponent of the float and range of the int
values, so doesn't directly follow from those 7 extra bits of precision.)