Hello, I was introduced to the method for generating a double in the interval [0,1) from a 64-bit int, outlined here
double d = (x >> 11) * 0x1.0p-53;
I was wondering if someone could explain the choice of shift and exponent values here. According to the text:
A standard double (64-bit) floating-point number in IEEE floating point format has 52 bits of significand, plus an implicit bit at the left of the significand. Thus, the representation can actually store numbers with 53 significant binary digits.
So I can see that that's where 53 is coming from, and 11+53 = 64. But I really don't have a strong understanding of why this is justified. I notice that other values return a double in the same range, but I assume there's some quirk of floating-point that means that the above example gives the best results in some way?
Any help shedding light on this is much appreciated. Also, going off the same logic I assume that the following is the "correct" way of generating a float from a 32-bit int in the same way? Thanks!
float f = (x >> 8) * 0x1.0p-24f;