2

I want to define feed-forward ANN as a function in a clean mathematical way and as accurately as possible. I can do something if I fix the number of layers, but I would like to generalize it for any number of layers (possibly without using "$\dots$"). Here is roughly what I have done:

Let $F$ be the ANN, $x \in R^{input}$ be the its input and $F(x) = y \in R^{output}$ its output. Let $f_i: R^{m_i} \rightarrow R^{m_i}$ be a generic activation function applied to the dot product between the weight matrix $W_i$ of layer $i$ and the layer input. $m_i$ is the number of outputs of layer $i$

$$ F: R^{input} \rightarrow R^{output} \\ F(x) = f_3(W_3 \cdot f_2(W_2 \cdot f_1(W_1 \cdot x)))$$

Can you find a better way?

1 Answers1

2

The $(\ldots)$ notation in mathematics usually stands for informal placeholder for recursion/induction. In the ZF axiomatic set theory there's a theorem called "The recursion theorem" that tells you when such recursively defined functions exist. If you want to be mathematically rigorous you can define $$F_0 (x) = f_1(W_1 \cdot x)$$ $$F_{n+1} (x) = f_{n+2}(W_{n+2} \cdot F_n(x))$$ Then if the number of layers is $l$ you can set $F = F_{l-1}$.

gcc-6.0
  • 695
  • Do you mean $f_{n+1}(W_{n+1} \cdot F_n(x))$? – Jair Taylor Jul 06 '18 at 16:58
  • @JairTaylor If we set it up like that it would fail in the first iteration. $$F_1(x) = F_{0+1}(x) = f_{0+1}(W_{0+1} \cdot F_{0}(x)) = f_1( W_1 \cdot f_1( W_1 \cdot x)) $$ – gcc-6.0 Jul 06 '18 at 17:06
  • 1
    @JairTaylor $F_1(x)$ is $f_2(W_2 \cdot f_1(W_1 \cdot x))$ not $f_1(W_1 \cdot x)$ so it works fine. And yes we could set $F_0 = id$ and change $n+2$ to $n+1$ but both are kinda arbitrary choices. We don't have neural networks with 0 layers anyway. – gcc-6.0 Jul 06 '18 at 17:18
  • oh, I see. My mistake. – Jair Taylor Jul 06 '18 at 17:20
  • Typically we would use a $0$-index to be the identity, e.g., $A^0 = id$ for a matrix $A$. But the choice is somewhat arbitrary. – Jair Taylor Jul 06 '18 at 17:23
  • 1
    @JairTaylor Yes you're correct. In that case it's much more convenient because it makes power arithmetic easier. In case of groups it's even more convenient to define the zeroth power of any element to be the identity because you can induce homomorpism from the additive group of integers to the cyclic group of any particular element. I just did not see any value in making statements about the zero layer neural network so I made that arbitrary choice which would make that more inconvenient. – gcc-6.0 Jul 06 '18 at 17:32