Understanding the explanation of a Bayesian prior for the weight of a linear regression model in Ian GoodFellow's Deep Learning

Question

In Ian GoodFellow's Deep Learning textbook, there is a description of using a Gaussian prior for the weight $w$ of a linear regression model.

$$p(w) = N(w; u_0, \Lambda_0) \propto exp(\frac{-1}{2}(w-u_0)^T \Lambda_0^{-1}(w-u_0))$$ Where $\mu_0$ and $\Lambda_0$ are the prior distribution mean vector and covariance matrix respectively.

The authors write (Chapter 5.6.0, page 134)

With the prior model thus specified, we can now proceed in determining the posterior distribution over the model parameters $$p(w|X,y) \propto p(y|X, w) p(w) $$ $$ \propto exp(\frac{-1}{2}(y-Xw)^T (y-Xw)) exp(\frac{-1}{2}(w-u_0)^T\Lambda_0^{-1}(w-u_0))$$ $$ \propto \frac{-1}{2}(-2y^TXw + w^TX^TXw + w^T \Lambda_0^{-1}w - 2u_0 \Lambda_{0}^{-1} w)) $$

** We now define $\Lambda_m = (X^TX+\Lambda_0^{-1})^{-1}$ and $\mu_m= \Lambda_m(X^Ty+ \Lambda_{0}^{-1}\mu_0)$. Using these new variables, we find that the posterior may be rewritten as a Gaussian distribution: $$p(w|X, y) \propto exp(\frac{-1}{2} (w-\mu_m)^T \Lambda_{m}^{-1} (w-\mu_m) + \frac{1}{2} \mu_m^T\Lambda_{m}^{-1} \mu_m)$$ $$ \propto exp(\frac{-1}{2} (w-\mu_m)^T \Lambda_m^{-1} (w-\mu_m))$$

Starting from (**), I'm a bit unsure how this second last equation is obtained. The covariance prior is assumed diagonal.

Expanding the exponentiated term I get $$w^T \Lambda_m^{-1}w - w^T \Lambda_{m}^{-1} \mu_m - \mu_{m}^T \Lambda_{m}^{-1} w + u_m^T \Lambda_{m}^{-1} \mu_m + \frac{1}{2} \mu_m^T \Lambda_{m}^{-1} \mu_m$$

Further expanding, sorry for the ugliness, I just don't see a cleaner way of trying to see these results.

$$w^T \Lambda_m^{-1}w - w^T \Lambda_{m}^{-1} \Lambda_m(X^Ty+ \Lambda_{0}^{-1}\mu_0) - (\Lambda_m(X^Ty+ \Lambda_{0}^{-1}\mu_0))^T \Lambda_{m}^{-1} w $$ $$+ (\Lambda_m(X^Ty+ \Lambda_{0}^{-1}\mu_0))^T \Lambda_{m}^{-1} (\Lambda_m(X^Ty+ \Lambda_{0}^{-1}\mu_0) $$ $$ + \frac{1}{2} (\Lambda_m(X^Ty+ \Lambda_{0}^{-1}\mu_0))^T \Lambda_{m}^{-1} (\Lambda_m(X^Ty+ \Lambda_{0}^{-1}\mu_0))$$

From here I'm not sure how to simplify. Any help appreciated.

score 1 · Accepted Answer · answered Oct 15 '22 at 17:18

If you stare very hard at the last line you can collect the right terms together, but the easiest way to approach this kind of problem is to identify the terms that need to be substituted out to obtain the final equation.

Start with:

$\begin{align} ... = -\frac{1}{2} \bigg(-2y^{T}Xw + w^{T}X^{T}Xw + w^{T}\Lambda_{0}^{-1}w - 2\mu_0^{T}\Lambda_0^{-1}w\bigg). \end{align}$

Then use $\Lambda_m = (X^TX+\Lambda_0^{-1})^{-1} \implies X^{T}X = \Lambda_m^{-1} - \Lambda_{0}^{-1}$ to remove some of the $X$s:

$\begin{align} ... &=-\frac{1}{2}\bigg( -2y^{T}Xw + w^{T}(\Lambda_m^{-1} - \Lambda_{0}^{-1})w + w^{T}\Lambda_{0}^{-1}w - 2\mu_0^{T}\Lambda_0^{-1}w \bigg) \\ &=-\frac{1}{2}\bigg( -2y^{T}Xw + w^{T}\Lambda_m^{-1}w - 2\mu_0^{T}\Lambda_0^{-1}w \bigg). \end{align}$

The next thing to eliminate is $y^{T}X$ using $\mu_m= \Lambda_m(X^Ty+ \Lambda_{0}^{-1}\mu_0) \implies X^{T}y = \Lambda_{m}^{-1}\mu_{m} - \Lambda_{0}^{-1}\mu_{0}$.

When transposing both sides, we should be aware that $\Lambda_{0}^{-1}, \Lambda_{m}^{-1}$ are symmetric ($\Lambda_0$ is a covariance matrix, which is symmetric, and sum/inverses of symmetric matrices are symmetric). So $y^{T}X = \mu_{m}^{T}\Lambda_{m}^{-1} - \mu_{0}^{T}\Lambda_{0}^{-1}$.

Substituting gives:

$\begin{align*} ...=-\frac{1}{2} \bigg( -2\mu_{m}^{T}\Lambda_{m}^{-1}w + w^{T}\Lambda_{m}^{-1}w\bigg) \end{align*}$.

What's left is to expand the next line and verify that it matches with the current one: you can take it away from here.

Great insights, thanks for your help. – IntegrateThis Oct 16 '22 at 07:01 — IntegrateThis, Oct 16 '22 at 07:01

Understanding the explanation of a Bayesian prior for the weight of a linear regression model in Ian GoodFellow's Deep Learning

1 Answers1