Machine learning Linear regression cost function

Question

I am doing a project in deep learning and I have been taking Andrew's machine learning course from youtube. I am having difficulty in understanding the working of cost function. given the equation below

J(θ)=minθ 1/2 i=1∑m (hθ(x(i))−y(i))2

where m is #of training examples lets say 20

this cost function calculates the error in prediction due to parameters in hypothesis

hθ(x)=θ0+θ1x1

where x1 is lets say number of bedrooms in a house and we want to find the cost of the house my question is why is x0=1 here

secondly, what is the initial value of θ here, is it random at first just like in gradient descent?

what i understand is suppose the hθ(x) predicts the cost of house, it predicts it by using different values of θ (keeping x input same) until when hypothesis value is put into cost function the cost is minimum. and the cost function works like it takes hθ(x) and subtracts it with (actual) y(i) and sums up the difference for all 20 training examples. meaning that it will calculate difference with all the values of training examples.

so y(i) here is the training set (all the cost of house values) and y(i) will be changed and different values will be used until all 20 training examples are checked or when minimum value of cost function is achieved?

in short what i am really confused about is that we will calculate cost by comparing hypothesis with every training example? and calculate hypothesis by changing values of θ and then use it in cost to get minimum prediction error?

please let me know if my concepts are correct and correct me if i am wrong.

i was stuck at the same problem..if you still have problem understanding see this link its article about cost function on medium — NILESH BHOSALE, Jul 26 '18 at 14:48

score 8 · Answer 1 · answered Nov 18 '14 at 21:13

Your first question asks why $x_0$ should be $1$:

Let's look at the hypothesis, $h_\theta(x) = \theta_0 +\theta_1x_1$, where $x_1$ is the number of bedrooms in the house. We can clearly see here that the coefficient of $x_1$ is $\theta_1$. Now, what we could do, is rewrite the hypothesis like this: $h_\theta(x) = \theta_0x_0 +\theta_1x_1$, where $x_0 = 1$, and really, we wouldn't have changed a thing, since we're just multiplying $\theta_0$ by one.

OK, but why do this? Well --- it's just a trick to make our notation simpler. If we define the vector $\mathbf{x} = \begin{pmatrix}x_0\\x_1\end{pmatrix}$ and $\mathbf{\theta} = \begin{pmatrix}\theta_0\\\theta_1\end{pmatrix}$, then we can concisely write the hypothesis as: $h_\theta(x) = \mathbf{\theta^T}\mathbf{x}$. In the case of a 2 dimensional feature vector, this may not seem like a big deal, but once we start moving to higher and higher dimensions, this notation is a lot easier to deal with.

Your second question asks about initial values of $\theta$:

I want to clarify reason behind the cost function, and how gradient descent is used and hopefully this will clear up any confusion you have. Look at it like this: the ultimate goal here is to estimate the parameters $\theta_0$ and $\theta_1$. What we need is some way to measure "how good" these estimates are. One way we could do this is by minimizing the difference between the right answers, the $y^{(i)}$s, and our estimated values, the $h_\theta(x^{(i)})$s.

We do this by defining the cost function: $J(\theta) = \frac{1}{2}\sum\limits_{i = 1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2$. We wish to minimize this function $J(\theta)$ since it represents the sum of the squared errors our hypothesis has made. In other words, we wish to find values for our $\theta$s which minimize $J(\theta)$. The key thing to realize is that we want to find our $\theta$s.

Gradient descent is simply one way we can do this. I'm not going to go into detail about it, but I will mention that you can start with random $\theta$s. Using gradient descent, we iteratively reach values which minimize $J(\theta)$. There are a bunch of different flavours of gradient descent -- batch, minibatch and stochastic gradient descent, but they essentially do the same thing.

I don't want to confuse you, but gradient descent is simply one method we may use in order to find our parameters, $\mathbf{\theta}$. We could also use the normal equations, for example, to do so.

Excellent explanation. +1 from me. – Biranchi Feb 26 '17 at 13:19 — Biranchi, Feb 26 '17 at 13:19

jee · Answer 2 · 2017-12-31T17:35:13.883

0

$$J(\theta) = \frac{1}{2}\sum\limits_{i = 1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2$$ should be changed to $$J(\theta) = \frac{1}{2m}\sum\limits_{i = 1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2$$

edited Dec 31 '17 at 17:35

answered Dec 31 '17 at 17:22

jee

101

Welcome to MSE. Please use MathJax. – José Carlos Santos Dec 31 '17 at 17:28

Machine learning Linear regression cost function

2 Answers2

Your first question asks why $x_0$ should be $1$:

Your second question asks about initial values of $\theta$: