1

I am currently trying to implement back propagation as described in the Wikipedia article.

It defines the gradient of the weights in layer $l$ as: $$\delta^l (a^{l-1})^T$$

where $a^{l}$ is is the output of layer $l$.

The article says:

Note that $\delta^l$ is a vector, of length equal to the number of nodes in level $l$; [...]

The number of entries of vector $a^{l}$ is equal to the number of nodes in layer $l$. But how can one calculate $\delta^l (a^{l-1})^T$ if layer $l-1$ and layer $l$ have a different number of nodes?

Luca9984
  • 59
  • 6
  • It's an outer product. That's defined for any sizes of the two vectors. Notice also that the dimensions of the matrix that results from that outer product is the same as the dimensions of the weight matrix that connects the two layers. – Joe Jun 08 '21 at 10:47
  • Note that $\delta^l$ has size $N\times1$ (where $N$ is the number of nodes at layer $l$) and that $(a^{l-1})^T$ has size $1\times M$ (where $M$ is the number of nodes at layer $l-1$) so the multiplication is possible, and it gives a matrix of size $N\times M$. This answer may help you to gain more insight: https://stats.stackexchange.com/questions/509860/attempting-to-implement-a-vectorized-version-of-one-of-the-backpropagation-equat/509911#509911 – Javier TG Jun 08 '21 at 11:00

0 Answers0