4

The well-known formula of calculating Sum of Squared Error for a cluster is this: SSE formula

where "c" is the mean and "x" is the value of an observation.

But this formula also brings the same result: Alternative SSE formula

where "m" is the number of the observations and "y" takes in every iteration, values of the observations.

For example, if we have {3, 7, 8} , our mean "c" = 6 and:

Using the usual formula: (6-3)² + (6-7)² + (6-8)² = 14

Using the alternative formula: [ 1∕(2*3) ] × [ (3-3)² + (3-7)² + (3-8)² + (7-3)² + (7-7)² + (7-8)² + (8-3)² + (8-7)² + (8-8)²] = 14

Starting from the first formula, I 'm trying to prove the alternative, but I 'm lost. Can someone help me with the proof?

2 Answers2

3

Let $\bar x = \frac{1}{n} \sum_{i=1}^n x_i$ be the sample mean; equivalently, $$n \bar x = \sum_{i=1}^n x_i.$$ Then $$\begin{align*} \sum_{i=1}^n \sum_{j=1}^n (x_i - x_j)^2 &= \sum_{i=1}^n \sum_{j=1}^n (x_i - \bar x + \bar x - x_j)^2 \\ &= \sum_{i=1}^n \sum_{j=1}^n (x_i - \bar x)^2 + (x_j - \bar x)^2 - 2(x_i - \bar x)(x_j - \bar x) \\ &= \sum_{i=1}^n (x_i - \bar x)^2 \sum_{j=1}^n 1 + \sum_{i=1}^n \sum_{j=1}^n (x_j - \bar x)^2 - 2 \sum_{i=1}^n (x_i - \bar x) \sum_{j=1}^n (x_j - \bar x) \\ &= (\operatorname{SSE})( n ) + \left(\sum_{i=1}^n \operatorname{SSE}\right) - 2 \left( n \bar x - \sum_{i=1}^n \bar x \right) \left( n \bar x - \sum_{j=1}^n \bar x \right) \\ &= 2n \operatorname{SSE}{}- 2 (n \bar x - n \bar x)(n \bar x - n \bar x) \\ &= 2n \operatorname{SSE}{} - 0 \\ &= 2n \operatorname{SSE}. \end{align*}$$

heropup
  • 135,869
  • A few questions: i)in step 3; why does "n" shows up? Why does "-" becomes " + " ? ii)in step 4; why does nx shows up and why is the parentheses squared? I'm sorry for the questions, i am new to the field. – Ian_Lane Dec 27 '19 at 03:58
  • @Ian_Lane I have added steps to clarify. The sign change was an error, but it did not affect the result because that term is equal to zero. – heropup Dec 27 '19 at 05:26
  • to be consistent with the first term, must not the second term be: $\sum_{j=1}^n (x_j-\bar{x})^2\sum_{i=1}^n 1$? – farruhota Dec 27 '19 at 05:50
  • @farruhota it is not necessary: I have evaluated each double sum in order. In the first case, $(x_i - \bar x)^2$ is a constant function of the index $j$, so it can be factored out. In the second case the sum of $(x_j - \bar x)^2$ over $j$ can be evaluated as the sum of squared errors; then the outer sum over $i$ is a constant function of this. – heropup Dec 27 '19 at 06:15
  • Fair enough, +1. – farruhota Dec 27 '19 at 06:37
3

Working with your numerical example: $$14=SSE=(6-3)² + (6-7)² + (6-8)² = \\ \left(\frac{3+7+8}{3}-3\right)^2+\left(\frac{3+7+8}{3}-7\right)^2+\left(\frac{3+7+8}{3}-8\right)^2=\\ \frac{(7-3+8-3)^2}{3^2}+\frac{(3-7+8-7)^2}{3^2}+\frac{(3-8+7-8)^2}{3^2}=\\ \frac{2[(7-3)^2+(8-3)^2+(7-8)^2]+2[(7-3)(8-3)+(3-7)(8-7)+(3-8)(7-8)]}{3^2}=\\ \frac{3[(7-3)^2+(8-3)^2+(7-8)^2]}{3^2}-\frac{[(7-3)+(3-8)+(8-7)]^2}{3^2}=\\ \frac{2[(7-3)^2+(8-3)^2+(7-8)^2]}{2\cdot 3}-0=\\ \frac{\sum_{i}\sum_{j}dist(x_i,y_j)^2}{2\cdot3}.$$

farruhota
  • 31,482