Basic Backpropagation Learning Strategy (Tutorial)

Greg Heath

unread,

Apr 19, 2008, 2:32:49 AM4/19/08

to

BASIC BACKPROPAGATION LEARNING STRATEGY

INTRODUCTION

The text below contains first draft notes for a lecture on the basic
strategy behind backpropagation learning.

Comments and corrections are welcome.

To limit the number of subscripts, only one output layer neuron, one
hidden layer (with H nodes) and one training input vector (with I
components ) are considered. This excludes consideration of batch
learning which in many cases is superior.

Zero mean inputs and tanh sigmoid hidden node activation units are
usually recommended for fast stable learning. However, for tutorial
purposes, logistic sigmoid activation units are used below for both
layers. Therefore

s(t) = 1 / ( 1 + exp(-t) )
ds/dt = s'(t) = s *(1-s)

The governing equations for the desired output y resulting from the
applied input training vector x = (x1, x2, ...xI) are

1. Net input to the jth hidden node

uj = SUM( i = 0, I ){ w1ji *xi }

2. Output of the jth hidden node

hj = s(uj)

3. Net input to the output node

v = SUM( j = 0, H ){ w2j *hj }

4. Output of the output node

z = s(v)

5. Output node error

e = z - y

6. Output node squared error

SE = e^2

The training strategy for changing weights to minimize the
nonnegative
squared error SE is to choose w1ji and w2j so that

SE >= 0 ==>

d(SE)/d(w1ji) <= 0,
and
d(SE)/d(w2j) <= 0.

The following weight changes are sufficient for that purpose:

Dw1ji = - eta1 *e *w2j *xi, (0 < eta1 < 1)
and
Dw2j = - eta2 *e *hj, (0 < eta2 < 1),

where eta1 and eta2 are empirical learning rates.

The term backpropagation is used to emphasize the fact that
the error in the output layer, e, is used to modify the hidden node
weights w1ji instead of trying to use the hidden node errors which
are unknown. This is made possible by a straightforward application
of the derivative chain rule.

OUTPUT LAYER TRAINING

Net input to the output layer neuron

v = SUM( j = 0, H ){ w2j *hj }

v Net input to the output neuron
H Number of hidden neurons
hj Output of the jth hidden neuron
( j = 1, 2 ...H )
w2j Weight applied to hj
h0 Output of a constant bias node
h0 == 1
w20 Output bias ( weight )

Change in v due to a change in w2j

Dv = ( dv/dw2j ) *Dw2j = hj *Dw2j

Output of the output neuron

z = s(v)

Change in output due to a change in v

Dz = ( dz/dv ) *Dv
= s'(v) *Dv
= z *( 1 - z ) *Dv

Change in output due to a change in w2j

Dz = ( dz/dw2j ) *Dw2j
= ( dz/dv ) *(dv/dw2j) *Dw2j
= z *( 1 - z) *hj *Dw2j

Output Error

e = z - y Output error
SE = e^2 Squared Error

Change in SE due to a change in w2j

DSE = ( dSE/dw2j ) *Dw2j

= 2 *e *( dz/dw2j ) *Dw2j
= 2 *e *z*( 1 - z ) *hj *Dw2j

Error Minimization Strategy

Dw2j = - eta2 *e *hj, (0 < eta2 < 1)

DSE = -2 *eta2 *z*(1-z) *(e*hj)^2 <= 0

HIDDEN LAYER TRAINING

Net input to the jth hidden layer neuron

uj = SUM( i = 0, I ){ w1ji *xi }

uj Net input to the jth hidden neuron
I Number of input fan-in units
xi Output of the ith fan-in unit
( i = 1, 2 ...I )
w1ji Weight applied to xi
x0 Output of a constant bias node
x0 == 1
w1j0 Bias for the jth hidden node

Change in uj due to a change in w1ji

Duj = ( du/dw1ji ) *Dw1ji = xi *Dw1ji

Output of the jth hidden neuron

hj = s(uj)

Change in hj due to a change in uj

Dhj = ( dhj/duj ) *Duj
= s'(uj) *Duj
= hj *( 1 - hj ) *Duj

Change in hj due to a change in w1ji

Dhj = ( dhj/dw1ji ) *Dw1ji
= ( dhj/duj ) *(duj/dw1ji) *Dw1ji
= hj * ( 1 - hj ) * xi *Dw1ji

Output of the output neuron

z = s(v)

v = SUM( j = 0, H ){ w2j *hj }
hj = s(uj)
uj = SUM( i = 0, I ){ w1ji *xi }

Change in output due to a change in w1ji

Dz = ( dz/dw1ji ) *Dw1ji
= ( dz/dv ) *(dv/hj) *(dhj/dw1ji) * Dw1ji
= z *( 1 - z) *w2j *hj *(1 - hj ) *xi *Dw1ji

Output error

e = z - y Output error
SE = e^2 Squared Error

Change in SE due to a change in w1ji

DSE = ( dSE/dw1ji ) * Dw1ji

= 2 *e *(dz/w1ji) * Dw1ji

= 2 *e *z*( 1 - z ) *w2j *hj *(1 - hj ) *xi *Dw1ji

Error Minimization Strategy

Dw1ji = - eta1 *e *w2j *xi , ( 0 < eta1 < 1)

DSE = -2 *eta1 *z*( 1 - z ) *hj *(1 - hj ) ( e *w2j *xi )^2 <= 0

Hope this helps.

Greg

Greg Heath

unread,

Apr 19, 2008, 9:28:05 AM4/19/08

to

On Apr 19, 2:32 am, Greg Heath <he...@alumni.brown.edu> wrote:
> BASICBACKPROPAGATIONLEARNING STRATEGY

squared error SE is to first choose w2j so that

SE = 0 ==> d(SE)/d(w2j) <= 0,

then choose wiji so that

SE = 0 ==> d(SE)/d(w1ji) <= 0,

The following weight changes are sufficient for that purpose:

Dw2j = - eta2 *e *hj, (0 < eta2 < 1),

then

Dw1ji = - eta1 *e *w2j *xi, (0 < eta1 < 1)

where eta1 and eta2 are empirical learning rates.

The term backpropagation is used to emphasize the fact that the
error

in the output layer, e, and the output layer weight, w2j, are used

Greg Heath

unread,

Apr 20, 2008, 12:15:25 PM4/20/08

to

On Apr 19, 9:28 am, Greg Heath <he...@alumni.brown.edu> wrote:
> On Apr 19, 2:32 am, Greg Heath <he...@alumni.brown.edu> wrote:
>
> > BASICBACKPROPAGATIONLEARNING STRATEGY
>
> > INTRODUCTION
>
> > The text below contains first draft notes for a lecture on the basic
> > strategy behindbackpropagationlearning.
>
> > Comments and corrections are welcome.
>
> > To limit the number of subscripts, only one output layer neuron, one
> > hidden layer (with H nodes) and one training input vector (with I
> > components ) are considered. This excludes consideration of batch
> > learning which in many cases is superior.
>
> > Zero mean inputs and tanh sigmoid hidden node activation units are
> > usually recommended for fast stable learning. However, fortutorial
> > purposes, logistic sigmoid activation units are used below for both
> > layers. Therefore
>
> > s(t) = 1 / ( 1 + exp(-t) )
> > ds/dt = s'(t) = s *(1-s)
>
> > The governing equations for the desired output y resulting from the
> > applied input training vector x = (x1, x2, ...xI) are

1. Net input to the jth hidden node ( j = 1, 2, ... H )

( j = 1, 2, ...H )

Hope this helps.

Greg-

Ondra Zizka

unread,

Apr 21, 2008, 2:42:03 AM4/21/08

to

> The following weight changes are sufficient for that purpose:
>
> Dw2j = - eta2 *e *hj, (0 < eta2 < 1),
>
> then
>
> Dw1ji = - eta1 *e *w2j *xi, (0 < eta1 < 1)
>
> where eta1 and eta2 are empirical learning rates.
>

> The term backpropagation is used to emphasize the fact that the
> error in the output layer, e, and the output layer weight, w2j, are
> used to modify the hidden node weights w1ji instead of trying to
> use the hidden node errors which are unknown. This is made
> possible by a straightforward application of the derivative chain
> rule.

I miss the explanation of what "w2j" aka "output layer weight" is. I know only about weights of connections leading from a neuron to another, thus I would expect w2_ji being weight of connection from neuron i to neuron j. Good guess?

Ondra

Greg Heath

unread,

Apr 21, 2008, 1:56:20 PM4/21/08

to

I limited the output nodes to 1 so that I could use w2j instead of
w2jk.
Usually I use i for input nodes (fan-in units, not neurons), j for
hidden nodes (neurons) and k for output nodes (neurons)..

Hope this helps.

Greg