Understanding some linear algebra for KL derivation

Question

Having some trouble understanding this proof in certain steps, even after trying to consult the matrix cookbook.

For two multivariate Gaussians $P_1, P_2 \in R^n$:

$KLD(P_1 || P_2) = E_{P_1}[\log P_1 - \log P_2]$

$= \frac{1}{2} E_{P_1}[-\log \det\Sigma_1 - (x - \mu _1)^T\Sigma_{1}^{-1}(x - \mu_1) + \log\det\Sigma_2 + (x - \mu _2)^T\Sigma_{2}^{-1}(x - \mu_2)]$

$= \frac{1}{2}\log \frac{\det\Sigma_2}{\det\Sigma_1} + \frac{1}{2}E_{P_1}[- (x - \mu _1)^T\Sigma_{1}^{-1}(x - \mu_1) + (x - \mu _2)^T\Sigma_{2}^{-1}(x - \mu_2)]$

$= \frac{1}{2}\log \frac{\det\Sigma_2}{\det\Sigma_1} + \frac{1}{2}E_{P_1}[tr (\Sigma_{1}^{-1}(x - \mu_1)(x - \mu _1)^T) + tr(\Sigma_{2}^{-1}(x - \mu_2)(x - \mu _2)^T)]$

$= \frac{1}{2}\log \frac{\det\Sigma_2}{\det\Sigma_1} + \frac{1}{2}E_{P_1}[tr (\Sigma_{1}^{-1}\Sigma_{1}) + tr(\Sigma_{2}^{-1}(xx^T - 2x\mu^{T}_{2} + \mu_2\mu_{2}^T)]$

Why does $(x-\mu)(x-\mu) = \Sigma_1$?

$= \frac{1}{2}\log \frac{\det\Sigma_2}{\det\Sigma_1} + \frac{1}{2}n + \frac{1}{2} tr(\Sigma_{2}^{-1}(\Sigma_1 + \mu_1\mu_{1}^T - 2\mu_2\mu^{T}_{1} + \mu_2\mu_{2}^T)]$

What rule gets rid of the EV?

$= \frac{1}{2}(\log \frac{\det\Sigma_2}{\det\Sigma_1} - n + tr(\Sigma_{2}^{-1}(\Sigma_1) + tr(\mu_{1}^T\Sigma_{2}^{-1}\mu_1 - 2\mu_{1}^T\Sigma_{2}^{-1}\mu_2 + \mu_{2}^T\Sigma_{2}^{-1}\mu_2)$

$= \frac{1}{2}(\log \frac{\det\Sigma_2}{\det\Sigma_1} - n + tr(\Sigma_{2}^{-1}(\Sigma_1) + (\mu_{2}-\mu_1)^T\Sigma_{2}^{-1}(\mu_{2}-\mu_1))$

How do you reduce (what is the rule) that last term?

Thanks

score 2 · Accepted Answer · answered Jun 15 '18 at 10:51

"Why does $(x-\mu)(x-\mu) = \Sigma_1$?"

It doesn't, what you do have is $$ E_{P_1}[(X_1-\mu_1)(X_1-\mu_1)^T] = \Sigma_1, $$ and this is the definition of the covariance matrix $\Sigma_1$. This gives you the step $$ \begin{align} E_{P_1}\left[\operatorname{Tr}\left(\Sigma_{1}^{-1}(X_1-\mu_1)(X_1-\mu_1)^T \right)\right] &= \operatorname{Tr}\left(\Sigma_{1}^{-1}E_{P_1}\left[(X_1-\mu_1)(X_1-\mu_1)^T \right]\right) = \operatorname{Tr}(\Sigma_{1}^{-1}\Sigma_1). \end{align} $$

"What rule gets rid of the EV?"

The rule is simply taking the expected value, and using the fact that the expectation and trace operator are interchangeable, also recall that $$ \Sigma_{1} = E_{P_1}[X_1 X_1^T] - \mu_1\mu_1^T, $$ then $$ \begin{align} E_{P_1}\left[\operatorname{Tr}\left(\Sigma_{2}^{-1}(X_1X_1^T-2X_1\mu_2^T+\mu_2\mu_2^T\right) \right] &= \operatorname{Tr}\left(\Sigma_{2}^{-1}E_{P_1}\left[X_1X_1^T - 2X_1\mu_2^T + \mu_2\mu_2^T\right]\right) \\ &=\operatorname{Tr}\left(\Sigma_{2}^{-1}\left[\Sigma_1 + \mu_1\mu_1^T - 2\mu_1\mu_2^T + \mu_2\mu_2^T\right]\right) \end{align} $$ where I have repeatedly used the linearity of expectation.

How do you reduce the last term?

The rule used at the end is the trace trick, which allows us to write for instance $$ \begin{align*} \operatorname{Tr}(\Sigma_{2}^{-1}\mu_1\mu_2^T) &= \operatorname{Tr}(\mu_2^T\Sigma_{2}^{-1}\mu_1) \\ &= \mu_2^T\Sigma_{2}^{-1}\mu_1\\ &= \mu_1^T\Sigma_{2}^{-1}\mu_2 \\ &=\operatorname{Tr}(\mu_1^T\Sigma_{2}^{-1}\mu_2). \end{align*} $$ and combine this with the quadratic expansion $$ (\mu_2-\mu_1)^T\Sigma_{2}^{-1}(\mu_2-\mu_1) = \mu_2^T\Sigma_2^{-1}\mu_2 - 2\mu_2^T\Sigma_{2}^{-1}\mu_1 + \mu_1^T\Sigma_{2}^{-1}\mu_1. $$

That should be all you need to follow the steps involved.

Understanding some linear algebra for KL derivation

1 Answers1