Matrix calculus with row vectors

Question

The rules for matrix calculus I find assume column vectors. Are the rules different for row vectors? (I am having a hard time finding them)

I used them when deriving a formula for backpropagation. \begin{gather*} Y\ =\ XW\ +\ B\\ X=\begin{bmatrix} x_{0} & x_{1} & x_{2} \end{bmatrix} ,\ Y=\begin{bmatrix} y_{0} & y_{1} \end{bmatrix} ,\ W=\begin{bmatrix} w_{00} & w_{01}\\ w_{10} & w_{11}\\ w_{20} & w_{21} \end{bmatrix} ,\ B=\begin{bmatrix} b_{0} & b_{1} \end{bmatrix} \end{gather*}

\begin{gather*} \left(\frac{\partial L}{\partial W}\right)^{T} =\begin{bmatrix} \frac{\partial L}{\partial w_{00}} & \frac{\partial L}{\partial w_{00}}\\ \frac{\partial L}{\partial w_{10}} & \frac{\partial L}{\partial w_{11}}\\ \frac{\partial L}{\partial w_{20}} & \frac{\partial L}{\partial w_{21}} \end{bmatrix} =\begin{bmatrix} \color{red}{\frac{\partial L}{\partial Y}}\frac{\partial Y}{\partial w_{00}} & \color{red}{\frac{\partial L}{\partial Y}}\frac{\partial Y}{\partial w_{01}}\\ \color{red}{\frac{\partial L}{\partial Y}}\frac{\partial Y}{\partial w_{10}} & \color{red}{\frac{\partial L}{\partial Y}}\frac{\partial Y}{\partial w_{11}}\\ \color{red}{\frac{\partial L}{\partial Y}}\frac{\partial Y}{\partial w_{20}} & \color{red}{\frac{\partial L}{\partial Y}}\frac{\partial Y}{\partial w_{21}} \end{bmatrix}\\ \\ Focus\ on\ one\ term:\\ y_{0} \ =\ w_{00} x_{0} +w_{10} x_{1} +w_{20} x_{2} \ +b_{0}\\ y_{1} \ =\ w_{01} x_{0} +w_{11} x_{1} +w_{21} x_{2} +b_{1}\\ \\ \frac{\partial Y}{\partial w_{00}} =\ \begin{bmatrix} \frac{\partial y_{0}}{\partial w_{00}}\\ \frac{\partial y_{1}}{\partial w_{00}} \end{bmatrix} =\begin{bmatrix} x_{0}\\ 0 \end{bmatrix}\\ \color{red}{\frac{\partial L}{\partial Y}}\frac{\partial Y}{\partial w_{00}} =\ \begin{bmatrix} \color{red}{\frac{\partial L}{\partial y_{0}}} & \color{red}{\frac{\partial L}{\partial y_{1}}} \end{bmatrix}\begin{bmatrix} x_{0}\\ 0 \end{bmatrix} =\color{red}{\frac{\partial L}{\partial y_{0}}} \ x_{0} \ +\ \color{red}{\frac{\partial L}{\partial y_{1}}} *\ 0\ =\color{red}{\frac{\partial L}{\partial y_{0}}} \ x_{0}\\ \\ \frac{\partial Y}{\partial w_{10}} =\ \begin{bmatrix} \frac{\partial y_{0}}{\partial w_{10}}\\ \frac{\partial y_{1}}{\partial w_{10}} \end{bmatrix} \ =\begin{bmatrix} x_{1}\\ 0 \end{bmatrix} ,\ \frac{\partial Y}{\partial w_{01}} =\ \begin{bmatrix} \frac{\partial y_{0}}{\partial w_{01}}\\ \frac{\partial y_{1}}{\partial w_{01}} \end{bmatrix} \ =\begin{bmatrix} 0\\ x_{0} \end{bmatrix} ,\ \frac{\partial Y}{\partial w_{11}} =\ \begin{bmatrix} \frac{\partial y_{0}}{\partial w_{11}}\\ \frac{\partial y_{1}}{\partial w_{11}} \end{bmatrix} \ =\begin{bmatrix} 0\\ x_{1} \end{bmatrix} ,\\ \frac{\partial Y}{\partial w_{20}} =\ \begin{bmatrix} \frac{\partial y_{0}}{\partial w_{20}}\\ \frac{\partial y_{1}}{\partial w_{20}} \end{bmatrix} \ =\begin{bmatrix} x_{2}\\ 0 \end{bmatrix} ,\ \frac{\partial Y}{\partial w_{21}} =\ \begin{bmatrix} \frac{\partial y_{0}}{\partial w_{21}}\\ \frac{\partial y_{1}}{\partial w_{21}} \end{bmatrix} \ =\begin{bmatrix} 0\\ x_{2} \end{bmatrix}\\ \\ Finally:\\ \left(\frac{\partial L}{\partial W}\right)^{T} =\begin{bmatrix} \frac{\partial L}{\partial y_{0}} \ x_{0} & \frac{\partial L}{\partial y_{1}} \ x_{0}\\ \frac{\partial L}{\partial y_{0}} \ x_{1} & \frac{\partial L}{\partial y_{1}} \ x_{1}\\ \frac{\partial L}{\partial y_{0}} \ x_{2} & \frac{\partial L}{\partial y_{1}} \ x_{2} \end{bmatrix} =\begin{bmatrix} x_{0}\\ x_{1}\\ x_{2} \end{bmatrix}\begin{bmatrix} \frac{\partial L}{\partial y_{0}} & \frac{\partial L}{\partial y_{1}} \end{bmatrix} =\ X^{T}\color{red}{\frac{\partial L}{\partial Y}} \end{gather*}

However, I am not sure if the final result has the correct shape.

\begin{gather*} \frac{\partial L}{\partial X} =\ \color{red}{\frac{\partial L}{\partial Y}}\frac{\partial Y}{\partial X}\\ \frac{\partial Y}{\partial X} =\begin{bmatrix} \frac{\partial y_{0}}{\partial x_{0}} & \frac{\partial y_{0}}{\partial x_{1}} & \frac{\partial y_{0}}{\partial x_{2}}\\ \frac{\partial y_{1}}{\partial x_{0}} & \frac{\partial y_{1}}{\partial x_{1}} & \frac{\partial y_{1}}{\partial x_{2}} \end{bmatrix} =\begin{bmatrix} w_{00} & w_{10} & w_{20}\\ w_{01} & w_{11} & w_{21} \end{bmatrix} =W^{T}\\ \frac{\partial L}{\partial X} =\ \color{red}{\frac{\partial L}{\partial Y}} W^{T} ,\color{red}{\frac{\partial L}{\partial Y} =}\color{red}{\begin{bmatrix} \color{red}{\frac{\partial L}{\partial y_{0}}} & \color{red}{\frac{\partial L}{\partial y_{1}}} \end{bmatrix}}\\ \\ If\ the\ rules\ were\ simply\ reversed:\\ \color{red}{\frac{\partial L}{\partial Y}}\color{red}{=}\color{red}{\begin{bmatrix} \color{red}{\frac{\partial L}{\partial y_{0}}}\\ \color{red}{\frac{\partial L}{\partial y_{1}}} \end{bmatrix}} ,\frac{\partial Y}{\partial X} =\begin{bmatrix} \frac{\partial y_{0}}{\partial x_{0}} & \frac{\partial y_{1}}{\partial x_{0}}\\ \frac{\partial y_{0}}{\partial x_{1}} & \frac{\partial y_{1}}{\partial x_{1}}\\ \frac{\partial y_{0}}{\partial x_{2}} & \frac{\partial y_{1}}{\partial x_{2}} \end{bmatrix}\\ Then\ the\ dimensions\ for\ \color{red}{\frac{\partial L}{\partial Y}}\frac{\partial Y}{\partial X} \ won't\ match \end{gather*}

Almost anything being defined or described that use column vectors you can take the transpose of both sides of such equations to have things described in terms of row vectors instead. For instance, I prefer using column vectors for markov chains and having $Av_t = v_{t+1}$ with $v_i$ as a column vector but you could just as easily had phrased this as $w_{t+1}=w_tB$ with row vectors where here $w_i=v_i^T$ and $B=A^T$ — JMoravitz, May 30 '23 at 16:40
Yes, I agree with @JMoravitz, I believe you can transpose everything on the equations and they would be equivalent. Just make sure you're multiplying the matrix and the vector in the correct order (left vs right). Check for a given row to make sure. — Tom, May 30 '23 at 17:12
@Tom What confuses me is that the rule for dL/dW (scalar and 3 by 2 matrix) will be the same whether I am using row or column vectors (right?) and I arrived at the shape 2 by 3 (which is correct since W is in the denominator?) even though I used rules for column vectors in the derivation — BPDev, May 30 '23 at 17:24

greg · Accepted Answer · 2023-05-31T10:29:13.337

1

$ \def\o{{\tt1}} \def\BR#1{\Big(#1\Big)} \def\LR#1{\left(#1\right)} \def\op#1{\operatorname{#1}} \def\trace#1{\op{Tr}\LR{#1}} \def\frob#1{\left\| #1 \right\|_F} \def\qiq{\quad\implies\quad} \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\m#1{\left[\begin{array}{r}#1\end{array}\right]} \def\c#1{\color{red}{#1}} \def\gradLR#1#2{\LR{\grad{#1}{#2}}} $Assume that the gradient $\LR{G=\grad LY}$ is known and use it to find the other gradients.

Calculate the differential of the loss function $$\eqalign{ dL &= G:dY \\ &= G:\BR{dX\:W + X\:dW} \\ &= \LR{GW^T}:dX + \LR{X^TG}:dW \\ }$$ Then hold $W$ constant to obtain the gradient with respect to $X,\,$ and vice versa $$\eqalign{ \grad LX &= GW^T \;\qquad\; \grad LW &= X^TG \qquad \\ \\ }$$

In the above, a colon is used to denote the Frobenius product.
It has the following properties $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \frob{A}^2 \qquad \{ {\rm Frobenius\;norm} \}\\ A:B &= B:A \;=\; B^T:A^T \\ C:\LR{AB} &= \LR{CB^T}:A \;=\; \LR{A^TC}:B \\ }$$

Note that the above derivation is quite general. It does not matter if $\,\{B,X,Y\}\,$ are row vectors or column vectors. They could even be matrices.

edited May 31 '23 at 10:29

answered May 31 '23 at 10:20

greg

35,825

Based on this rule, shouldn't dL/dW (W: 3 x 2) have shape 2 x 3? (X_T * G: (3 x 1) x (1 x 2)) – BPDev May 31 '23 at 14:58
@BPDev I use the opposite Layout Convention than the one used in that rule. These issues are addressed in the section immediately following your link. – greg May 31 '23 at 16:10
Ah thank you. To confirm, the layout conventions do not depend on how we choose to represent vectors correct? (For numerator layout, dY/dL is a column vector, but Y doesn't have to be a column vector despite the phrasing "Numerator layout, i.e. lay out according to y and xT.") – BPDev May 31 '23 at 18:06
1

@BPDev Consider the initial known gradient $\left(G=\frac{\partial L}{\partial Y}\right),$ where $Y$ is a row vector. The Layout Convention determines whether $G$ is a row vector or a column vector. For me, the best convention is the one that makes $G$ the same shape as $Y$ because that's the one which allows me to write $$dL = G:dY$$ without introducing unnecessary transpose operations. – greg May 31 '23 at 19:22

Matrix calculus with row vectors

1 Answers1