Floating-point arithmetic error

Question

Suppose you need to generate $n + 1$ equally spaced points on the interval $[a, b]$, with spacing $h = \frac{b-a}{n}$. In floating-point arithmetic, which of the following methods: $x_0=a$, $x_k = x_{k-1}+h$ ($k=1,2,...n$) or $x_k = a+kh$ ($k=0,...,n$) is better?

My attemp: I think the first one is better, because only the multiplication error resulting from $kh$ may be big, and that error will be increased significantly by the addition with $a$ due to the relatively large distance between $a$ and $kh$. On the other hand, the first one only handles error in addition, and since the distance between $x_{k-1}$ and $h$ is not as large as those between $a$ and $kh$, we get the conclusion.

Can anyone help review my argument above to see if it's correct? Many thanks for your help.

score 2 · Answer 1 · answered Jan 28 '15 at 17:34

2

I believe the second is better. You can try it in python:

In [1]: 0.1+0.1+0.1+0.1+0.1+0.1+0.1+0.1+0.1+0.1
Out[1]: 0.9999999999999999

In [2]: 10*0.1
Out[2]: 1.0

answered Jan 28 '15 at 17:34

Mark McClure

30,510

I will try it out. Many thanks for your help! – ghjk Jan 28 '15 at 19:28
I tried that one out in MATLAB, but the first one gave me 1.000, not 0.999999:P – ghjk Jan 29 '15 at 15:29
@user177196 That's really a formatting issue. Try 1-(0.1+0.1+0.1+0.1+0.1+0.1+0.1+0.1+0.1+0.1). – Mark McClure Jan 29 '15 at 15:38
1

It works!! Thanks a lot:) – ghjk Jan 29 '15 at 15:45

AlexR · Accepted Answer · 2015-01-30T20:11:38.420

2

Hint
Let's make a really bad FLOP unit, rounding to integers. Try to divide $[0,5]$ into $16$ parts of length $\frac13$.

Method 1 will give you $x_i = 0$ for all $i$ because $0 + \frac13 = 0.\bar3$ is rounded to $0$ in every iteration.

Method 2 will give you $(0,\frac13,\frac23,1,1\frac13,1\frac23,2,\ldots) \to (0,0,1,1,1,2,2,2,3,3,3,4,4,4,5,5)$

The same problem occurs on a smaller scale with a good FLOP. In general, it's always worse to use erroneous input ($x_{i+1} = f(x_i)$) for computations than to use perfect information ($x_{i+1} = g(a,h,i)$)

Now in general a floating point unit (FLU) is specified by the tolerances, $\epsilon, \delta$ such that $$|[a+b]-a-b| \le \epsilon \ \forall a,b\in\mathbb R\\ |[a\cdot b] - ab| \le \delta \ \forall a,b\in\mathbb R$$ This allows you to make estimates:

Method 1 $$e_i = |[x_{i-1} \pm e_{i-1} + h] - (x_{i-1} + h)| \le e_{i-1} + \epsilon \Rightarrow e_i \le i \epsilon$$ Where the worst-case happens when $x_i$ and $h$ are chosen such that the FLU makes maximum error, and the $e_{i-1}$-term is the "carried" error. Thus the error increases linearly with the number of knots. $e = \mathcal O(n)$

Method 2 $$e_i = |[x_0 + [k\cdot h]] - x_0 - kh| \le |[x_0 + kh \pm \delta] - x_0 - kh| \le \epsilon + \delta$$ Here the error is constant, even in worst case. $e = \mathcal O(1)$

Thus the error in Method 2 is at most a constant wich is much smaller than linear for growing $n$. Method 2 is thus provably better.

Notation
By $[a+b]$ I refer to the result we get when "asking" our FLU for $a+b$. In an ideal FLU, this only depends on the accurate result $a+b$, so $$[\cdot]:\mathbb R \to \mathbb R$$ can be thought of as a "round to nearest representable number"-function. Then the FLU specification tells us that $$|[x]-x| \le \epsilon$$ or in terms of the supremum norm $$\|[\cdot]-\mathrm{Id}\|_\infty \le \epsilon$$ I allow addition and multiplication to have different accuracy ($\epsilon$ and $\delta$ resp.) but when estimating the errors in worst-case (best-case is always $0$ error^^), you can think of $[\cdot]$ as a function $$[x] = \cases{x+\epsilon\\x-\epsilon}$$ whatever increases the total error. (That is out FLU delivers the greatest error possible within its specification)

Numerical Experiment
The following MATLAB-code produces a plot of the error. You'll find errB = 0 while errA already climbs to $10^{-12}$ for $h = 10^{-5}$.

N = 10.^(1:5);
errA = 0*N;
errB = 0*N;
for i=1:length(N)
    n=N(i);
    h=1/n;
    xA=0;
    xB=0;
    for k=1:n
        xA = xA + h;
    end
    xB = n*h;
    errA(i) = abs(xA-1);
    errB(i) = abs(xB-1);
 end
 loglog(N,errA,'r-',N,errB,'b-');

edited Jan 30 '15 at 20:11

answered Jan 28 '15 at 17:34

AlexR

24,905

1

Very useful for me! Many thanks, AlexR:) – ghjk Jan 28 '15 at 19:28
Btw, can you please give this problem a try? http://math.stackexchange.com/questions/1122922/min-exponent-range-in-normalized-floating-point-system – ghjk Jan 28 '15 at 19:35
I wonder if it's "legit" to require the rounding to 0 between iterations, since the first formula is $x_k=x_{k-1}+h$, not $x_k=[x_{k-1}]+h$? By the way, did you mean $15$ parts of length $\frac{1}{3}$? – ghjk Jan 29 '15 at 15:33
@user177196 We get $16$ knots, $15$ intervals. Yes it is, the FLOP is specified by the error |[a+b] - (a+b)| where [.] denotes the operation of the circuit. – AlexR Jan 29 '15 at 15:35
what is a+b? I used your error formula $|[x_{k+1}]-x_{k+1}|$ but didn't get $0$ in every iteration:P I interpreted $[.]=$ floor function (I don't see what you meant with "operation of the circuit" though) – ghjk Jan 29 '15 at 15:43
So I got the following result: if we use the error formula $[x_{k+1}]-x_{k+1}$ with $a=0, b=1$, $15$ intervals, then the first formula gave an error sequence (-0.0677, -0.1333, -0.2,...,-1) while the second formula gave a smaller error sequence at every iterations (0, -0.0667, -0.1333,....,-0.9333,0). Is this what you really intended to do to show that method 2 works better than method 1? – ghjk Jan 29 '15 at 15:54
Let me expand into the answer once I'm on a computer again. "The circuit" is the imperfect computation unit, while the symbol itself refers to the exact values. – AlexR Jan 29 '15 at 16:39
Please help expand your formula, since I don't quite see how it can be used to demonstrate the difference between the two methods:P – ghjk Jan 30 '15 at 02:29
@user177196 I have written it out now. If you're unfamiliar with the notation, check out Landau symbols. – AlexR Jan 30 '15 at 11:00
Many thanks for your elaboration! But I still don't get how the notation [.] operates and how to do arithmetic with it:P Can you please explain this further? – ghjk Jan 30 '15 at 17:54
@user177196 I hope you understand the notation now from my elaboration. You can think of $[\cdot]$ as a function, maybe call it $\mathrm{eval}$ wich describes the behaviour of your FLU. Ideally $\mathrm{eval}(x) = x$, but the small errors (of order $\epsilon$) are unavoidable because we use a finite representation for any real number (so in fact we are tying ourselves to a small subset of $\mathbb Q$ when working with a computer or calculator. – AlexR Jan 30 '15 at 18:03
I finally see it now!! Great answer btw:) Based on my understanding on the details you provided, if I want to implement the two methods on the computer to show that method 2 works better, I can use the error formula: $error_i=[x_{i+1}]-x_{i+1}$ where [.] denotes the floor function, and $x_{i+1}$ is the value of $x_{i+1}$ in each method. Comparing the two sequences of errors, I got: (-0.0677, -0.1333,...,-0.9333,0) and (0,-0.0667,...,-0.9333,0), and the 2nd sequence is better. Is this a valid approach? – ghjk Jan 30 '15 at 19:22
1

@user177196 To "simulate" the error (why would you want to do that, anyways?!?) on a computer you should use the round function, else your $\epsilon = 1$ wich is even larger. Also note that the error terms should be measured as absolute value. A numerical experiment to actually see the superiority of Method 2 would rather be to divide $[0,1]$ into $N$ parts using the two methods and increasing $N$. As an error measure, record the last knot error, $|x_N - 1|$ and plot it in a loglog scale for $N\to 10^{10}$ in exponential steps or something. – AlexR Jan 30 '15 at 19:29
Many thanks for pointing out that I should have compared the last knot with 1:) That actually makes more sense to me than what I did. And the reason for doing such thing is because the problem asks to use computer to show the difference:P – ghjk Jan 30 '15 at 19:40

Floating-point arithmetic error

2 Answers2