2

I have a set of coordinates $\{(x_1,y_1),(x_2,y_2),\ldots,(x_n,y_n)\}$ , where for every $i<n$, $$ x_i \ll x_{i\space+\space1}\space\space\text{and}\space\space y_i\ll y_{i\space+\space1} $$ I know that the data points have a linear relationship $y=\alpha\space+\space\beta x$.

However, when using the simple slope formula for the regression $\beta=\frac{\sum_{i=1}^n(x_i-\overline{x})(y_i-\overline{y})}{\sum_{i=1}^n(x_i-\overline{x})^2}$, due to the nature of larger values, the entire formula simplifies to $\beta=\frac{y_n}{x_n}$, which ignores all data-points but the largest one. The regression line I need must take the other data-points (and the errors they contribute) into consideration.

One potential solution to this problem is looking at the errors in a way that the bigger they become, the less their impact increases. In other words, using the error equation $$ \ln\hat\varepsilon_i=y_i-\alpha-\beta x_i $$ Instead of the original equation $\hat\varepsilon_i=y_i-\alpha-\beta x_i$.

In the end, this all boils down to finding the ordered pair $(\hat\alpha,\hat\beta)$ which solves $$ \min_{\alpha,\space \beta} Q(\alpha,\beta)\space\space \text{Where}\space\space Q(\alpha,\beta)=\sum_{i=1}^n e^{2(y_i-\alpha-\beta x_i)} $$ So what pair $(\hat\alpha,\hat\beta)$ solves this ordeal?


EDIT: as @user121049 mentioned, the function I applied on the error causes $\alpha$ to approach towards $\infty$. Therefore, we need a function to apply on our error that still accomplishes the behavior described above. This means the function $f$ must accomplish:

  1. the range of $f$ must be $\Bbb{R}$ (this means the error can be any value)

  2. $\lim_{x\to0^\pm}f(x)$ cannot be $\pm\infty$ (this makes sure neither $\alpha$ nor $\beta$ have a reason to approach towards $\pm\infty$)

  3. $f=o(x)$ (this makes sure that the error's impact increases less the bigger its value is)

  4. $f$ must be an increasing function (this makes sure the error becomes more effective the bigger it is)

A potential solution for $f$ is $f(x)=\left\{^{\text-\sqrt {\text-x};\;x<0}_{\,\sqrt x\;;\;x\ge0}\right.$ . This time, the equation we need to solve is: $$ \min_{\alpha,\space \beta} Q(\alpha,\beta)\space\space \text{Where}\space\space Q(\alpha,\beta)=\sum_{i=1}^n (y_i-\alpha-\beta x_i)^4 $$ In this scenario, what pair $(\hat\alpha,\hat\beta)$ is the solution?

NOTE: The use of other potential functions is also welcomed.

NODO55
  • 146
  • You can model the variance as $\sigma^2 \propto x_i$ or some other function of $x_i$ if that looks more appropriate. You can do this using maximum likelihood or alternatively read up on weighted least squares. – user121049 Jan 14 '18 at 13:43
  • @user121049 I tried that strategy out, and it does bring a more interesting regression. However, I would also like to see how the alternate method I proposed above affects the regression. – NODO55 Jan 15 '18 at 08:20
  • Your suggestion favours negative errors. The minimum would be with $\alpha=\infty$. – user121049 Jan 15 '18 at 09:01
  • If you do the usual linear regression is there evidence that the error changes with $x$. Maybe read this article https://en.wikipedia.org/wiki/Heteroscedasticity. – user121049 Jan 16 '18 at 08:44
  • Doesn't you suggested forth power function make things worse as the solution will have terms like $\sum x_i^4$ which are even more dominated by the large $x_i$. – user121049 Jan 16 '18 at 08:48

0 Answers0