1

I am self-learning basic applied numerical analysis. In the book, Numerical methods in Scentific computing Volume I by Dahlquist and Bjoerck on page 109, the Sterbenz lemma is stated as follows:

(Sterbenz Lemma.) Let the floating point numbers $x$ and $y$ satisfy

$$y/2 \le x \le 2y$$

Then, $fl(x - y) = x - y$, unless $x - y$ underflows.

I am not quite able to comprehend the proof of this really useful and interesting lemma.

Questions.

  1. Is the idea that the exponents of $x$ and $y$ differ by at most $1$, derived from the fact here that $\beta=2$ and $y/2 \le x \le 2y$ (that is are we working in the floating-point binary64 system)?

  2. How do you get the inequality $x/2 \le y \le x < 2$?

Quasar
  • 5,410

1 Answers1

1

[ I dont have the reputation to comment or edit. I think in your numbered question (2) includes a typorgaphical error. ]

Item 1: Excluding underflow.

The exclusion for $\dot{x}-\dot{y}$ underflowing is to deal with incomplete floating point implementations that do not include subnormal numbers. Subnormal numbers are required for the following to hold $$\dot{x} - \dot{y} = 0 \longrightarrow \dot{x} = \dot{y}$$

Item 2: Explaining what is going on.

The problem can be simplified by symmetry:

Using $\mathbb{F}$ to denote a floating point format, then $$\dot{x} \in \mathbb{F} \longleftrightarrow -\dot{x} \in \mathbb{F}$$ so $$\left(\dot{a} - \dot{b}\right) \in \mathbb{F} \longleftrightarrow \left(\dot{b}-\dot{a}\right) \in \mathbb{F}$$ and its thus sufficient to consider the problem on the domain $$0 < \frac{1}{2}\dot{a} \le \dot{b} \le \dot{a}$$

Now consider 3 cases:

  1. When the exponent of $\dot{a}$ is 2 or more larger than the exponent of $\dot{b}$.

    Then the difference is not guaranteed to be exact, but $\dot{b} < \frac{1}{2}\dot{a}$ so this case is excluded by the domain restriction.

  2. When the exponent of $\dot{a}$ is the same as the exponent of $\dot{b}$.

    Then the difference is trivially representable.

  3. When the exponent of $\dot{a}$ is exactly 1 more than the exponent of $\dot{b}$.

    Then the difference is representable if one bit is cancelled, and one bit will be cancelled if the bit pattern has certain properties (think long hand subtraction with explicit borrow ...) and restricting the domain to $0 < \frac{1}{2}\dot{a} \le \dot{b} \le \dot{a}$ is sufficient (but not strictly necessary) to ensure the bit pattern has the necessary properties.

It should be easy to do some examples by hand of the interesting 3rd case to verify that what is required is that the string of leading $1$ bits in $\dot{b}$ is at least as long as the string of leading one bits in $\dot{a}$.