1

I am reading this amazing tutorial and so far everything was clear and good. Unfortunately, there is this section which doesn't make sense to me: enter image description here Why is the derivative not a diagonal matrix but a vector? According to this page tanh's derivative is a diagonal matrix. Tanh and max looks really similar to me. The tutorial also makes it clear that elementwise binary operators have diagonal Jacobians.

And it makes sense: when I differentiate $max(0, x_i)$ w.r.t $x_j$ it should be $0$, right?

What am I missing?

Kristof
  • 21
  • 2
    It looks like that are doing a component-wise derivative. I.e. the partial of this vector is defined as the vector of partials if that makes sense. This is what they mean by broadcasting across the elements. – krc Feb 22 '20 at 17:30

2 Answers2

1

You're not missing anything, you're just noticing how sloppy the math is in your current field of study.

When operating on a vector argument, functions are applied element-wise. The differential of such a function is given by $$f = f(x) \quad\implies\quad df = f'(x)\odot dx$$ where $\odot$ denotes the elementwise/Hadamard product and $f'(x)$ is the ordinary scalar derivative, which is also applied element-wise.

The Hadamard product between two vectors can always be eliminated by converting one of the vectors into a diagonal matrix, e.g. $$\eqalign{ a\odot b = Ab \quad\Longleftarrow\quad A = {\rm Diag}(a) }$$ Eliminating the Hadamard product from the differential yields the gradient as $$\eqalign{ \frac{\partial f}{\partial x} &= F' = {\rm Diag}\big(f'(x)\big) \\ }$$ These ideas apply not just to $\,\tanh(x)\,$ but to any function including $\,\max(0,x)\;-$ also known as $\,\operatorname{ReLu}(x).$


I notice that some of the comments mention broadcasting to explain/excuse the sloppy mathematics that afflicts the field of neural nets/machine learning. But broadcasting is something different.

Broadcasting simply pads the dimensions of a scalar/vector/matrix/tensor via repeated dyadic multiplication with all-ones vectors. For example $$\eqalign{ &A\in {\mathbb R}^{m\times n} \qquad &v\in {\mathbb R}^{m\times 1} \qquad {\tt1}\in {\mathbb R}^{n\times 1} \\ &A\odot v \qquad&\big({\rm incompatible}\big) \\ &A \odot (v{\tt1}^T) \qquad&\big({\rm compatible\,via\,broadcast}\big) \\ }$$ Broadcasting works for simple multiplication and division, but is worthless (and confusing) when calculating gradients and Jacobians.

greg
  • 35,825
0

@KyleC has the right idea here. Given the functions you're looking at, I suspect you're studying neural nets. These are often implemented with libraries such as numpy to facilitate a multidimensional generalization of SIMD vectorization (see also here). From that perspective, a matrix is a vector of vectors (typically a column vector of row vectors in my experience, but that's not the only option).

J.G.
  • 115,835
  • Thanks for the answer both of you. You very well might be right, the only reason I am struggling with this is that nowhere else in the paper they did that. Whenever they talked about derivative it was never the component wise one... Also, how am I meant to find out which one is it? They seem to have the exact same notation... – Kristof Feb 23 '20 at 08:59
  • @Kristof Even if you don't pick up on them giving it away with the term broadcasting (which I hope they defined for you somewhere), you already know the reason why it's a matrix. When a scalar-to-scalar function $f$ is vectorized viz. $f(x)i=f(x_i)$, $\partial_if_j=\partial_if(x_j)=f^\prime(x_j)\delta{ij}$ is not only a matrix, but a diagonal one (which, admittedly, might be identified at times with its leading diagonal, construed as a vector). As you noted, it doesn't matter whether $f(x)=\tanh x$ or $f(x)=\max(0,,x)$. – J.G. Feb 23 '20 at 09:18