INTRODUCTION
The text below contains first draft notes for a lecture on the basic
strategy behind backpropagation learning.
Comments and corrections are welcome.
To limit the number of subscripts, only one output layer neuron, one
hidden layer (with H nodes) and one training input vector (with I
components ) are considered. This excludes consideration of batch
learning which in many cases is superior.
Zero mean inputs and tanh sigmoid hidden node activation units are
usually recommended for fast stable learning. However, for tutorial
purposes, logistic sigmoid activation units are used below for both
layers. Therefore
s(t) = 1 / ( 1 + exp(-t) )
ds/dt = s'(t) = s *(1-s)
The governing equations for the desired output y resulting from the
applied input training vector x = (x1, x2, ...xI) are
1. Net input to the jth hidden node
uj = SUM( i = 0, I ){ w1ji *xi }
2. Output of the jth hidden node
hj = s(uj)
3. Net input to the output node
v = SUM( j = 0, H ){ w2j *hj }
4. Output of the output node
z = s(v)
5. Output node error
e = z - y
6. Output node squared error
SE = e^2
The training strategy for changing weights to minimize the
nonnegative
squared error SE is to choose w1ji and w2j so that
SE >= 0 ==>
d(SE)/d(w1ji) <= 0,
and
d(SE)/d(w2j) <= 0.
The following weight changes are sufficient for that purpose:
Dw1ji = - eta1 *e *w2j *xi, (0 < eta1 < 1)
and
Dw2j = - eta2 *e *hj, (0 < eta2 < 1),
where eta1 and eta2 are empirical learning rates.
The term backpropagation is used to emphasize the fact that
the error in the output layer, e, is used to modify the hidden node
weights w1ji instead of trying to use the hidden node errors which
are unknown. This is made possible by a straightforward application
of the derivative chain rule.
OUTPUT LAYER TRAINING
Net input to the output layer neuron
v = SUM( j = 0, H ){ w2j *hj }
v Net input to the output neuron
H Number of hidden neurons
hj Output of the jth hidden neuron
( j = 1, 2 ...H )
w2j Weight applied to hj
h0 Output of a constant bias node
h0 == 1
w20 Output bias ( weight )
Change in v due to a change in w2j
Dv = ( dv/dw2j ) *Dw2j = hj *Dw2j
Output of the output neuron
z = s(v)
Change in output due to a change in v
Dz = ( dz/dv ) *Dv
= s'(v) *Dv
= z *( 1 - z ) *Dv
Change in output due to a change in w2j
Dz = ( dz/dw2j ) *Dw2j
= ( dz/dv ) *(dv/dw2j) *Dw2j
= z *( 1 - z) *hj *Dw2j
Output Error
e = z - y Output error
SE = e^2 Squared Error
Change in SE due to a change in w2j
DSE = ( dSE/dw2j ) *Dw2j
= 2 *e *( dz/dw2j ) *Dw2j
= 2 *e *z*( 1 - z ) *hj *Dw2j
Error Minimization Strategy
Dw2j = - eta2 *e *hj, (0 < eta2 < 1)
DSE = -2 *eta2 *z*(1-z) *(e*hj)^2 <= 0
HIDDEN LAYER TRAINING
Net input to the jth hidden layer neuron
uj = SUM( i = 0, I ){ w1ji *xi }
uj Net input to the jth hidden neuron
I Number of input fan-in units
xi Output of the ith fan-in unit
( i = 1, 2 ...I )
w1ji Weight applied to xi
x0 Output of a constant bias node
x0 == 1
w1j0 Bias for the jth hidden node
Change in uj due to a change in w1ji
Duj = ( du/dw1ji ) *Dw1ji = xi *Dw1ji
Output of the jth hidden neuron
hj = s(uj)
Change in hj due to a change in uj
Dhj = ( dhj/duj ) *Duj
= s'(uj) *Duj
= hj *( 1 - hj ) *Duj
Change in hj due to a change in w1ji
Dhj = ( dhj/dw1ji ) *Dw1ji
= ( dhj/duj ) *(duj/dw1ji) *Dw1ji
= hj * ( 1 - hj ) * xi *Dw1ji
Output of the output neuron
z = s(v)
v = SUM( j = 0, H ){ w2j *hj }
hj = s(uj)
uj = SUM( i = 0, I ){ w1ji *xi }
Change in output due to a change in w1ji
Dz = ( dz/dw1ji ) *Dw1ji
= ( dz/dv ) *(dv/hj) *(dhj/dw1ji) * Dw1ji
= z *( 1 - z) *w2j *hj *(1 - hj ) *xi *Dw1ji
Output error
e = z - y Output error
SE = e^2 Squared Error
Change in SE due to a change in w1ji
DSE = ( dSE/dw1ji ) * Dw1ji
= 2 *e *(dz/w1ji) * Dw1ji
= 2 *e *z*( 1 - z ) *w2j *hj *(1 - hj ) *xi *Dw1ji
Error Minimization Strategy
Dw1ji = - eta1 *e *w2j *xi , ( 0 < eta1 < 1)
DSE = -2 *eta1 *z*( 1 - z ) *hj *(1 - hj ) ( e *w2j *xi )^2 <= 0
Hope this helps.
Greg
SE = 0 ==> d(SE)/d(w2j) <= 0,
then choose wiji so that
SE = 0 ==> d(SE)/d(w1ji) <= 0,
The following weight changes are sufficient for that purpose:
Dw2j = - eta2 *e *hj, (0 < eta2 < 1),
then
Dw1ji = - eta1 *e *w2j *xi, (0 < eta1 < 1)
where eta1 and eta2 are empirical learning rates.
The term backpropagation is used to emphasize the fact that the
error
in the output layer, e, and the output layer weight, w2j, are used
1. Net input to the jth hidden node ( j = 1, 2, ... H )
( j = 1, 2, ...H )
Hope this helps.
Greg-
I miss the explanation of what "w2j" aka "output layer weight" is. I know only about weights of connections leading from a neuron to another, thus I would expect w2_ji being weight of connection from neuron i to neuron j. Good guess?
Ondra
I limited the output nodes to 1 so that I could use w2j instead of
w2jk.
Usually I use i for input nodes (fan-in units, not neurons), j for
hidden nodes (neurons) and k for output nodes (neurons)..
Hope this helps.
Greg