0

I am trying to minimize a function A wrt W, so seeking its gradient

$$ A = \ln \det ( WW^T + \sigma^2I)$$

So according to the chain rule I found

$$ \frac{\partial A}{\partial W} = tr((\frac{\partial g(U)}{\partial U})^T \cdot \frac{\partial U}{\partial W_{ij}}) $$

Where

$$ U = WW^T \sigma^2 I $$ $$ g(U) = \ln \det U $$

I found also that

$$ \frac{\partial \ln \det U}{\partial U} = tr(U^{-1}\partial U) $$

Since $ \partial U $ wrt U should be just a matrix full of ones, call it S,

$$ \frac{\partial \ln \det U}{\partial U} = tr(U^{-1}S) $$

And also

$$ \frac{\partial U}{\partial W_{ij}} = \frac{\partial WW^T + \sigma^2 I}{\partial W_{ij}} = \frac{\partial WW^T}{\partial W_{ij}} $$

Which I found is

$$ \frac{\partial WW^T}{\partial W_{ij}} = WJ^{ji} + J^{ij} W^T $$

So putting it all together

$$ \frac{\partial A}{\partial W} = tr(tr(U^{-1}S)^T \cdot (WJ^{ji} + J^{ij} W^T)) $$

(The transpose can be dropped as our function is scalar.)

However, this result does not seem to agree with a simple numerical derivation. Why is this?

lericson
  • 189

1 Answers1

1

Rewrite the function in terms of the trace function and find its differential $$\eqalign{ A &= \log(\det(WW^T+\sigma^2 I)) \cr &= {\rm tr}(\log(WW^T+\sigma^2 I)) \cr\cr dA &= (WW^T+\sigma^2 I)^{-T}:d(WW^T) \cr &= (WW^T+\sigma^2 I)^{-T}:2\,{\rm sym}(dW\,W^T) \cr &= 2\,{\rm sym}(WW^T+\sigma^2 I)^{-1}:dW\,W^T \cr &= 2\,(WW^T+\sigma^2 I)^{-1}W:dW \cr }$$ Since $dA=(\frac{\partial A}{\partial W}:dW),\,$ the gradient must be $$\eqalign{ \frac{\partial A}{\partial W} &= 2\,(WW^T+\sigma^2 I)^{-1}W \cr }$$ The above derivation employs both the Frobenius (:) product and the sym() function $$\eqalign{ {\rm sym}(M) &= \frac{1}{2}(M+M^T) \cr A:M &= {\rm tr}(A^TM) \cr }$$

john316
  • 1,286
  • Wow, you're great at this! How did you know what to do? Is there a set of rules you recommend? Like a cookbook. – lericson Nov 18 '15 at 12:32
  • I have another problem now, trying to find out $\sum_i^N \frac{\partial y_i^T (WW^T + \sigma^2 I) y_i }{\partial W} $, and I get really odd expressions. What would you do here? – lericson Nov 18 '15 at 13:42
  • 1
    Collect the ${y_i}$ into a single matrix term $$S=\sum_i^N y_i y_i^T$$ The new function can be written as the Frobenius product $$S:(WW^T+\sigma^2 I)$$ The differential of a Frobenius (or Kronecker, Hadamard, etc} product follows the standard rule $$d(S:A)=dS:A+S:dA$$ Assuming that the ${y_i}$ are independent of $W$ the $dS$ term is zero. – john316 Nov 18 '15 at 14:38
  • Alright, and $ S : dA $ should be the Frobenius product of S with the differentiation of $ WW^T + \sigma^2 I $? I think this part is also hard -- I get an expression that gives a matrix for each element. – lericson Nov 18 '15 at 16:04
  • 1
    That term is pretty straightforward. $$\eqalign{dA &= d(WW^T+\sigma^2 I)\cr &= d(WW^T) \cr &= dW,W^T + W,dW^T\cr &= 2,{\rm sym}(dW,W^T)\cr\cr S:dA &= 2,S:{\rm sym}(dW,W^T)\cr &= 2,{\rm sym}(S): dW,W^T\cr &= 2,S:dW,W^T\cr &= 2,SW:dW\cr }$$where I've used the fact that $S$ is symmetric, so $S={\rm sym}(S)$. – john316 Nov 18 '15 at 19:04
  • Fantastic, @john316! I see the Frobenius product is diligently used throughout your derivations. Interesting. One last question -- the first term resulted in a matrix expression of all the derivatives at the same time, while $SW : dW$ seems not to allow for this. Is that right? If so, why? – lericson Nov 19 '15 at 09:17
  • Huh? The gradient of your new function is $\frac{\partial f}{\partial W} = 2,SW,,$ which is a full matrix. – john316 Nov 19 '15 at 19:45
  • Ah, I think I see. So the gradient of both terms is $ \frac{\partial L}{\partial W} = 2({WW}^T + \sigma^2 I)^{-1}W + 2 SW $. You find the second term of the gradient function by rewriting it using the Frobenius product. The result is an expression that is the Frobenius product $M : dW = M$, since $dW$ is all ones. Is that correct? – lericson Nov 20 '15 at 11:35
  • Also I realize that $S = {YY}^T$, is that right? – lericson Nov 20 '15 at 14:44
  • That is, where $Y$ is a matrix whose columns are $y_i$ – lericson Nov 20 '15 at 14:56
  • Yes, you can write $S=YY^T$. But the reasoning behind using the differential to find the gradient is this $$dL = \frac{\partial L}{\partial W}:dW$$ If you have found an expression which only works when $dW=1$, then you cannot identify the gradient from that expression. But if you have an expression holds for arbitrary values of $dW$, then you have found the gradient. – john316 Nov 22 '15 at 02:06