core of softmax

8 views
Skip to first unread message

Joe ODonnell

unread,
Nov 8, 2011, 3:08:49 PM11/8/11
to SmartTypes: a tool for social discovery
In trying to understand softmax well enough to implement it, I think
this is the most illuminating description I have found so far. It
explains how softmax is a generalization of logistic regression. The
math equations aren't shown correctly in this email but can be seen in
the linked document.

13.1 Softmax and multinomial units
For a binary unit, the probability of turning on is given by the
logistic sigmoid function of its total
input, x.
ex
1
p = σ(x) = =
x (15)
1 + e-x e + e0
The energy contributed by the unit is -x if it is on and 0 if it is
off. Equation 15 makes it clear that
the probability of each of the two possible states is proportional to
the negative exponential of its
energy. This can be generalized to K alternative states.
e xj

(16)
pj = K xi
i=1 e
This is often called a "softmax" unit. It is the appropriate way to
deal with a quantity that has
K alternative values which are not ordered in any way. A softmax can
be viewed as a set of binary
units whose states are mutually constrained so that exactly one of the
K states has value 1 and the
rest have value 0. When viewed in this way, the learning rule for the
binary units in a softmax is
identical to the rule for standard binary units. The only difference
is in the way the probabilities of
the states are computed and the samples are taken.
A further generalization of the softmax unit is to sample N times
(with replacement) from the
probability distribution instead of just sampling once. The K
different states can then have integer
values bigger than 1, but the values must add to N . This is called a
multinomial unit and, again, the
learning rule is unchanged.

http://www.cs.toronto.edu/~hinton/absps/guideTR.pdf

Reply all
Reply to author
Forward
0 new messages