I am trying to minimize a function A wrt W, so seeking its gradient
$$ A = \ln \det ( WW^T + \sigma^2I)$$
So according to the chain rule I found
$$ \frac{\partial A}{\partial W} = tr((\frac{\partial g(U)}{\partial U})^T \cdot \frac{\partial U}{\partial W_{ij}}) $$
Where
$$ U = WW^T \sigma^2 I $$ $$ g(U) = \ln \det U $$
I found also that
$$ \frac{\partial \ln \det U}{\partial U} = tr(U^{-1}\partial U) $$
Since $ \partial U $ wrt U should be just a matrix full of ones, call it S,
$$ \frac{\partial \ln \det U}{\partial U} = tr(U^{-1}S) $$
And also
$$ \frac{\partial U}{\partial W_{ij}} = \frac{\partial WW^T + \sigma^2 I}{\partial W_{ij}} = \frac{\partial WW^T}{\partial W_{ij}} $$
Which I found is
$$ \frac{\partial WW^T}{\partial W_{ij}} = WJ^{ji} + J^{ij} W^T $$
So putting it all together
$$ \frac{\partial A}{\partial W} = tr(tr(U^{-1}S)^T \cdot (WJ^{ji} + J^{ij} W^T)) $$
(The transpose can be dropped as our function is scalar.)
However, this result does not seem to agree with a simple numerical derivation. Why is this?