In trying to understand softmax well enough to implement it, I think
this is the most illuminating description I have found so far. It
explains how softmax is a generalization of logistic regression. The
math equations aren't shown correctly in this email but can be seen in
the linked document.
13.1 Softmax and multinomial units
For a binary unit, the probability of turning on is given by the
logistic sigmoid function of its total
input, x.
ex
1
p = σ(x) = =
x (15)
1 + e-x e + e0
The energy contributed by the unit is -x if it is on and 0 if it is
off. Equation 15 makes it clear that
the probability of each of the two possible states is proportional to
the negative exponential of its
energy. This can be generalized to K alternative states.
e xj
(16)
pj = K xi
i=1 e
This is often called a "softmax" unit. It is the appropriate way to
deal with a quantity that has
K alternative values which are not ordered in any way. A softmax can
be viewed as a set of binary
units whose states are mutually constrained so that exactly one of the
K states has value 1 and the
rest have value 0. When viewed in this way, the learning rule for the
binary units in a softmax is
identical to the rule for standard binary units. The only difference
is in the way the probabilities of
the states are computed and the samples are taken.
A further generalization of the softmax unit is to sample N times
(with replacement) from the
probability distribution instead of just sampling once. The K
different states can then have integer
values bigger than 1, but the values must add to N . This is called a
multinomial unit and, again, the
learning rule is unchanged.
http://www.cs.toronto.edu/~hinton/absps/guideTR.pdf