20

I understand that the function "squashes" a real vector space between the values 0 and 1.

However I don't see what this has to do with the "max" function, or why that makes it a "softer" version of the max function.

Parcly Taxel
  • 103,344
user56834
  • 12,925

4 Answers4

16

The largest element in the input vector remains the largest element after the softmax function is applied to the vector, hence the "max" part. The "soft" signifies that the function keeps information about the other, non-maximal elements in a reversible way (as opposed to a "hardmax", which is just the standard maximum function).

The function produces a probability distribution from any vector, and is thus used in machine learning when inputs need to be classified. The output of a neural network is normalised via this function, and this normalisation is required for machine learning techniques to work.

Parcly Taxel
  • 103,344
  • 1
    If that is the argument, then shouldn't all continuous monotonically increasing functions be called softmax functions? – user56834 Aug 11 '16 at 05:26
  • 1
    The softmax function is defined on a vector and outputs a vector. It is not possible to define a "monotonically increasing" relation on vectors, and there is no continuity within vectors themselves, being made of a discrete number of elements. – Parcly Taxel Aug 11 '16 at 05:36
  • 5
    The "hardmax" function takes a vector and sets its largest element to 1, and all others to 0. The softmax function does almost the same thing, but it is continuous, and most machine learning techniques require this property to train neural networks, hence the "soft" modifier. – Parcly Taxel Aug 11 '16 at 05:42
3

I always thought it was called softmax because it is differentiable ("soft") at all points for all elements of the input vector. This explanation would be analogous to what makes the softplus function, $f(x) = \ln(1 + e^x)$, the "soft" version of $f(x) = \max(0, x)$

2

According to this book about neural networks, consider a slightly modified version of the softmax:

$$\frac{e^{c\cdot z_j}}{\sum_k e^{c\cdot z_k}}$$

With $c=1$ we get back to the softmax.

This is for sure still a probability distribution: everything sums up to $1$, and all quantities are positive. So we can write for the event $j$:

$$P(j) = \frac{e^{c\cdot z_j}}{\sum_k e^{c\cdot z_k}}$$

Now take the limit $c\rightarrow\infty$.

Consider the ratio of two probabilities and takes to limit to infinity:

$$\lim_{c\rightarrow \infty}\frac{e^{c\cdot z_j}}{e^{c\cdot z_k}} = \lim_{c\rightarrow \infty} e^{p(z_j-z_k)}$$

It can be $1$ if $z_j=z_k$, $0$ if $z_j<z_k$, or $\infty$ otherwise.

Now consider the second case. For each $j$ for which this happens, it means that at the denominator of $P(j)$ there is a quantity that goes to infinity faster (or to $0$ slower) than the denominator. Then $P(j)=0$.

So the only $P$ different from zero are the one(s) equal to the biggest values. Suppose there are $n$ elements for which this hold, i.e., $n$ elements for which $z_\ell=z_m$ (I used different indexes to avoid confusion).

Then:

$$\lim_{c\rightarrow \infty} P(\ell) = \frac{e^{c\cdot z_\ell}}{n e^{c\cdot z_\ell}} = \frac{1}{n}$$

If you put these elements in a vector you get a vector of zeros, except for those elements for which the value is maximum. In classification problems, where softmax is used, typically there is one element having the maximum value (its probability is bigger than the others). So the resulting vector is a vector of one element of $1$ and all the others of $0$. Anyway such a vector has values different from zero for those elements having a maximum value: it is basically the a function that maximise the maximum value(s): having probabilities, the biggest possible value is $1$. This is commonly termed max function.

In the softmax one takes $c=1$, it is like taking a soften version of the above function: it is a softmax.

1

Softmax got it’s name from being a “soft” max (or better - argmax) function. I.e., unlike a regular argmax function, which will assign 1 to the maximum element in a vector, and 0 for the rest,

$\begin{pmatrix} 1 \\ 6 \\ 2 \end{pmatrix} \to \begin{pmatrix} 0 \\ 1 \\ 0 \end{pmatrix} $

the softmax will assign a high value to the maximum number, and small values for the rest:

$\begin{pmatrix} 1 \\ 6 \\ 2 \end{pmatrix} \to \begin{pmatrix} \frac{e^1}{e^1+e^6+e^2} \\ \frac{e^2}{e^1+e^6+e^2} \\ \frac{e^6}{e^1+e^6+e^2} \end{pmatrix} = \begin{pmatrix} 0.006 \\ 0.975 \\ 0.017 \end{pmatrix} $