2

In Numerical Computing With Matlab it is stated that(Under the IEEE standard): "the smallest positive normalized floating-point number has f = 0 and e = −1022. The largest floating-point number has f a little less than 1 and e = 1023."

We have realmin = $2^{-1022}$ and realmax = $(2-eps)*2^{1023}$

where eps = $2^{-52}$

My question is why does the computation for realmax involve an extra $(2-eps)$ term and not realmin? Realmax ends up being approximately equal to $2^{1024}$ but this value is larger than $2^{1023}$ thus violating the maximum allotted number of bits for the exponent. (11 bits for exponent in IEEE)

You can see realmin is $2^{-1022}$ and not any value lower is allowed. Why is this not the case for realmax?

bluesky11
  • 161
  • 3
    This doesn't seem to be a question about maths. – almagest Jan 17 '20 at 08:28
  • They cannot represent 2^1024 with 1023 bits, so they are trying to get a slightly smaller number. More often than not MATLAB is used for large calculations, so the extra room by doing this is probably helpful – Dhanvi Sreenivasan Jan 17 '20 at 08:32
  • Presumably Matlab uses something like floating point arithmetic. I would guess that the lowest possible (non-zero) value for the significand is zero, and the highest possible value is 2 - eps. – Ben Grossmann Jan 17 '20 at 08:36
  • @almagest: It's definitely a question relevant for numerical analysis. Whether you count that as math or computer science is perhaps a matter of taste. At my university the numerical analysts are in the math department, at least. – Hans Lundmark Jan 17 '20 at 09:22

1 Answers1

2

The 64 bit double-precision floating point format has 1 sign bit, 11 bits for the exponent, and 52 bits for the fractional part of the significand.

The exponent 11111111111 (eleven ones, in binary) is used for special purposes (inf & nan), so the largest exponent available for numbers is 11111111110, which is 2046 in decimal, but there's an offset of 1023 that one must subtract, so it really represents $2046-1023=1023$.

And the largest fraction is 1111111111111111111111111111111111111111111111111111 (fifty-two ones), which has an implicit “1.” in front, so that it means the binary number 1.1111111111111111111111111111111111111111111111111111, or $2-2^{-52}=2-\varepsilon$ in other words (not quite “two”, just as close as you can get).

So $$ \mathrm{realmax} = (2-\varepsilon) \times 2^{1023} . $$

Similarly, the exponent 00000000000 is used for special purposes (zero & subnormal numbers), so the smallest exponent available for normalized numbers is 00000000001, which after subtracting the offset becomes $1-1023=-1022$.

And the smallest fraction is 0000000000000000000000000000000000000000000000000000, which corresponds to the number 1.0000000000000000000000000000000000000000000000000000 (so just “one”, no $\varepsilon$ here).

Therefore $$ \mathrm{realmin} = 1 \times 2^{-1022} . $$

Hans Lundmark
  • 53,395
  • 1
    +1 Ok, you persuaded me! – almagest Jan 17 '20 at 09:25
  • @Hans Lundmark Thank you this is clears up a lot of the confusion. In some references to floating point arithmetic, they consider the fractional part of the significand to be between [0.5, 1). How does that apply in the context of realmin and realmax? – bluesky11 Jan 17 '20 at 18:39
  • From the answer of this post: https://math.stackexchange.com/questions/184630/what-does-mantissa-mean-here I would have assumed the mantissa would need to be between [0,1] not [0.5,1). – bluesky11 Jan 17 '20 at 18:40
  • It's just because they use an implied “0.1” in front of the fraction instead of “1.”, so that the whole mantissa is just half as large, and then they shift the offset by 1 (subtract 1022 instead of 1023) so that $2^{\text{exponent}}$ becomes twice as big instead, to compensate. – Hans Lundmark Jan 17 '20 at 21:34
  • @Hand Lundmark But then how do you represent a number like 1,232 considering the mantissa can't be below 0.5? – bluesky11 Jan 17 '20 at 21:56
  • As I said, make the mantissa half as large (instead of $m \in [1,2)$, use $m'=m/2 \in [\tfrac12,1)$), and increase the exponent by one (use $e'=e+1$ instead of $e$); then $x = m \times 2^e = m' \times 2^{e'}$. So $x=1.232 = 1.232 \times 2^0$ becomes $x=0.616 \times 2^1$ instead. – Hans Lundmark Jan 18 '20 at 09:55
  • Genius.. thank you – bluesky11 Jan 20 '20 at 03:44