3

I'm having a hard time trying to derive the maths behind LSTMs and vanishing gradients.

I had a of help from LSTM forward and backward pass, but I got stuck in page 11 from LSTM forward and backward pass.

Given the image:

enter image description here

We can form system of equations, $$ \begin{bmatrix} a^t \\ i^t \\ f^t \\ o^t \\ \end{bmatrix} = \begin{bmatrix} tanh(W_cx^t+U_ch^{t-1}) \\ \sigma(W_iX^t+U_ih^{t-1}) \\ \sigma(W_fx^t+U_fh^{t-1}) \\ \sigma(W_ox^t+U_oh^{t-1}) \\ \end{bmatrix} = \begin{bmatrix} tanh(\hat a^t) \\ \sigma (\hat i^t) \\ \sigma (\hat f^t) \\ \sigma (\hat o^t) \\ \end{bmatrix} $$ We can then represent this as $z$:

$$ z= \begin{bmatrix} \hat a^t \\ \hat i^t \\ \hat f^t \\ \hat o^t \\ \end{bmatrix} = \begin{bmatrix} W^c & U^c \\ W^i & U^i \\ W^f & U^f \\ W^o & U^o \\ \end{bmatrix} * \begin{bmatrix} x^t \\ h^{t-1} \\ \end{bmatrix} $$

We can find out the backprop derivation for $z$ from page 10 from LSTM forward and backward pass

$$ \delta z= \begin{bmatrix} \delta \hat a^t \\ \delta \hat i^t \\ \delta \hat f^t \\ \delta \hat o^t \\ \end{bmatrix} = \begin{bmatrix} \delta a^t \odot (1-tanh^2(\hat a^t)) \\ \delta i^t \odot i^t \odot (1-i^t) \\ \delta f^t \odot f^t \odot (1-f^t) \\ \delta o^t \odot o^t \odot (1-o^t) \\ \end{bmatrix} $$

However the next part at page 11 from LSTM forward and backward pass is where I'm confused.

Given $\delta z$, we need to find $\delta W$, $\delta h^{t-1}$,

1) The author wrote down $\delta I^t = W^T * \delta z$:

If we do some linear algebra variables moves:

$$z = W^T * I^t$$

Multiply both sides with $I^{t^{-1}}$

$$I^{t^{-1}} z = W^T$$

Multiply both sides with $z^{-1}$

$$I^{t^{-1}} = z^{-1} W^T$$

Somehow this doesn't match with the author's formula?

2) Let's ignore 1), and try to solve for $\delta I$

$$ \delta I = \begin{bmatrix} \delta x^t \\ \delta h^{t-1} \\ \end{bmatrix} $$ $$ \delta I = \frac{dE}{dI} = \begin{bmatrix} \frac{dE}{dx^t} \\ \frac{d}{dh^{t-1}} \\ \end{bmatrix} $$

But $\frac{d}{dh^{t-1}}$ depends on a lot of the equations in $z$

Do I solve for them individually and add them up?

Note: $\frac{dE}{d\hat i_t}$ can be found at page 10 from LSTM forward and backward pass

$$h_{t-1}^{i_t}=\frac{dE}{dh_{t-1}^{i_t}}=\frac{dE}{d\hat i_t}\frac{d\hat i_t}{h_{t-1}}=\frac{dE}{d\hat i_t}\frac{d}{dh_{t-1}}i_t(1-i_t)$$

Replace $i_t$ with $\sigma(W_iX^t+U_ih^{t-1})$, replace $\frac{dE}{d\hat i_t}$ with $\delta \hat i_t \delta i_t$

$$=\delta \hat i_t \delta i_t \frac{d}{dh_{t-1}} \sigma(W_iX^t+U_ih^{t-1}) (1-\sigma(W_iX^t+U_ih^{t-1}))$$

It looks solvable, then my question is I would get 4 equations like the above, do I add them all together in the end to get $\delta h^{t-1}$? For example:

$$ \delta I = \frac{dE}{dI} = \begin{bmatrix} \frac{dE}{dx^t} \\ \frac{d}{dh^{t-1}} \\ \end{bmatrix} = \begin{bmatrix} ignore \\ h_{t-1}^{i_t}+h_{t-1}^{a_t}+h_{t-1}^{f_t}+h_{t-1}^{o_t} \\ \end{bmatrix} $$

Since my logic was to find the total error contributed from $h_{t-1}$ so you need to add them together?

3) Finding for $W$ looks even like a bigger task, however, I'm not sure where to start on this?

4) How does this relate to the error carousel? I mean after derivation of all the weights and $h_{t-1}$, I'm not sure how this leads to avoidance of vanishing gradients? I read somewhere that the weights are constant 1 or something along the lines like that?

I know this is kinda long, but feel free to ask for clarification if my question does not make sense. I think I've tried to solve for this almost for half a month now.

Appreciate any sort of guidance. Thanks.

  • It's hard following the notation. But as far as I understand, when presenting a sequence of length $N$ to a recurrent neural network, we unroll it to obtain a big non-recurrent networks with $N$ layers, and whose weights are constrained to be "the same" in each layer, then we can compute the backpropagation as usual in this unrolled network, obtaining a gradient for the weights. – reuns Jul 11 '17 at 07:38
  • @user1952009 I don't quite understand your comment about "the same" in each layer. When you mean layer, do you mean the unrolled network? Because here I'm just assuming to be 1 layer (meaning that we don't stack 2 LSTMs on top of each other). Doesn't the weight change for every unroll? Since the weights change during each time step, so if we use the same weights for each time step, I don't think that is correct? – user1157751 Jul 11 '17 at 07:49
  • The weights are fixed, only the activations depend on the input. Try with only one neuron $n$, one input $i$, one self-recurrent connexion and one output $o$ (ie. $3$ weights $i \to n,n\to n,n \to o$). Present a sequence of $T$ values at the input and simulate the network for $T$ timesteps. Unroll it to obtain a non-recurrent network with $T$ layers (and only one neuron $n_t$ in each layer) and $3T$ weights : $i_t \to n_t, n_t \to n_{t+1}, n_t \to o_t$. The constraint is that those weights do not depend on $t$. – reuns Jul 11 '17 at 08:03
  • For this, simply compute the backpropagation as usual (as if the weights would depend on $t$) then sum over $t$ the obtained gradients. Finally, optimize everything to obtain the forward and backward pass. – reuns Jul 11 '17 at 08:07
  • @user1952009 Sorry I do not understand your equations. If you can write it out with some examples that would be great as an answer. – user1157751 Jul 11 '17 at 09:00
  • What do you not understand ? I told you to take the simplest recurrent neural network, to simulate it and unroll it for $T$ timesteps, and see how you could apply backpropagation to it. This is exactly what is explained in Arun's website, except he replaced my single recurrent neuron by a LTSM. – reuns Jul 11 '17 at 09:14
  • @user1952009 I'm going to try it out and comment back, thanks. – user1157751 Jul 11 '17 at 20:13

0 Answers0