CUDA-based Feed-Forward Neural Net, Training provides poor MSE

Alex Ge

unread,

Nov 18, 2015, 11:30:13 PM11/18/15

to

Hello,

I've been coding a CUDA-based feed-forward neural network (github.com/alexge233/cuANN.git) because I wanted to learn the ins and outs by implementation, and because I intend to use it for my research (once it works properly and is optimised).

I implemented it using Jeff Heaton's videos: https://www.youtube.com/user/HeatonResearch/videos and in specific, the training videos (parts 1,2,3 and 4).

It is a very simple feed-forward neural net, fully connected, using sigmoid as activation, mean-square error and the delta rule with back-propagation.

Assuming I have implemented the code correctly (there is always a possibility I've done something wrong) I am getting poor results, using as my test subject the XOR problem.

When training a XOR network (/samples/xor.cu) the MSE starts at 0.24/0.25 and drops incredibly slowly. Trying another problem (Abelone) instead has a constant MSE of 0.115216.

I have tested both online and batch training, and have tried to make sure the implementation is correct (I am using FANN as a benchmark and doing the XOR verification on paper). I can't rule a bug out, but I would like to know:

a) the MSE appears to be oscillating. Using learning rate 0.7 and momentum 0.3 as suggested by Heaton, the MSE will go a tiny bit lower than 0.24 and then raise up.

b) I am not using Bias neurons. Are they a requirement? If so, I can make changes.

c) Batch versus Online training appear to make little difference. Why is that?

d) Under certain network architectures (e.g., many hidden layers, or a hidden layer with more nodes than the input layer) I get zero gradients (lots of them).
Is this normal/part of the gradient descent, or have I indeed got a bug?

Feel free to have a look if you want to, only requirements are CUDA compute 3.0 or higher, and Thrust (which comes with CUDA) and a recent compiler.

At the moment, the XOR sample appears to be very slow at learning, for example, when using 50000 epochs and batch training, I get:

Epoch 5000 MSE: 0.249934
Epoch 10000 MSE: 0.247509
Epoch 15000 MSE: 0.244783
Epoch 20000 MSE: 0.242798
Epoch 25000 MSE: 0.242017
Epoch 30000 MSE: 0.241144
Epoch 35000 MSE: 0.240172
Epoch 40000 MSE: 0.239828
Epoch 45000 MSE: 0.239733

The Abelone network simply gets stuck at the same MSE:

Epoch 100 MSE: 0.115216
Epoch 200 MSE: 0.115216
Epoch 300 MSE: 0.115216
Epoch 400 MSE: 0.115216

Many thanks in advance to anyone who can help!

Ray

unread,

Nov 19, 2015, 3:10:07 AM11/19/15

to

Okay, first thing; make sure the random range you're using to initialize weights
is centered on zero. Lots of first-time implementers initialize weights all with
the same sign, and they get results much like you're describing.

I'd also recommend adjusting your sigmoid function to be sigmoid(x) - 0.5 instead
of sigmoid(x). Your partial derivative in terms of error will be exactly the same
as with the unadjusted sigmoid, but the smoother symmetry around zero will give you
simpler fitness landscapes with fewer "wrinkles" and local minimums.

A bias node would still be helpful and make training fasteer, but if you use about
twice as many nodes in each layer as XOR actually requires, it won't be absolutely
necessary.

Also be careful not to initialize weights to values too large. The best heuristic I
know of is a random number between plus and minus sqrt(6)/sqrt(n) where n is the number
of nodes in the previous layer plus the number of nodes in the subsequent layer.

Finally XOR has only 4 training/test cases. Be sure you're presenting them in
random-ish orders each training epoch or you can get stuck in cycles in a space
in the fitness landscape where they all present different gradients and where
presenting them in the same order just makes the weights cycle. And there
is no need to present every example exactly once every epoch, although that is
nice and tidy. It makes as much sense or more to just present four at random
each epoch; that's more likely to escape local minima.

Alex Ge

unread,

Nov 19, 2015, 1:02:35 PM11/19/15

to

Ray thank you very much for the reply!
My weights are uniformly initialised between -0.1 and 0.1.

I will make the changes you suggested:

1) Instead of sigmoid, use sigmoid - 0.5.
2) Randomize order of training data (I have not been doing that)
3) I'll try the formula sqrt(6)/sqrt(n) for random weight value.
4) Finally, I may have to implement the bias neuron (this is a lot of work).

A few questions for you:

1) My random weights are uniformly distributed. Is that OK, or should I opt for another type of distribution?
2) Should I, instead of sigmoid, use the Tanh? I've read a few papers which suggest its better than sigmoid.
3) I am normalising/scaling my input between 0 and 1. The actual XOR data was -1 to 1. This slightly improved accuracy. Is that common practice?

Many thanks!
Alex

Ray

unread,

Nov 20, 2015, 3:00:06 AM11/20/15

to

Alex Ge <alex...@hotmail.com> wrote:
>
> 1) My random weights are uniformly distributed. Is that OK, or should I opt for another type of distribution?

Uniform distributions are fine.

> 2) Should I, instead of sigmoid, use the Tanh? I've read a few papers which suggest its better than sigmoid.

Tanh is somewhat more stable and likely to converge in a controlled way in deeper networks, because its
tails are fatter/heavier than the sigmoid's tails. But if you're not trying for more than a couple of
hidden layers, sigmoid will usually train them a bit faster.

If you're training a really deep (more than 4 hidden layers) or recurrent network though, you can't use either of them;
in a very deep network they'll drive your deeper layers to saturation before your shallower layers stabilize, and in
a recurrent network they just won't stabilize anything because every correction will keep everything changing at
different rates. This happens because both of them have exponentially decaying distributions toward the tails.

For stability and convergence somewhere in the middle of the space where good solutions are more likely, in a large
network (>4 hidden layers) you would wind up using either a nonsigmoid activation function like the ramp function,
or a subexponential sigmoid like the softsign. But with XOR? It's a simple problem, a single hidden layer, and
there is no harm in sticking with the sigmoid. Of course, XOR is usually just making sure your implementation
works before you move on to real problems, so...

Ramp activation function --> (x > 0) ? x : 0

softsign --> x/( |x| + 1 )

You can even use a subhyperbolic sigmoid like the so-called "magic sigmoid" if you can stand how slow it trains.
It's very very stable. "Magic" can recover from all sorts of weird stuff or work in bizarre neural network
architectures where connections between nodes are completely arbitrary and they never heard of layers.

But. It. Is. Slow. And its derivative is ugly. And even if you get it right deeper networks are horribly
prone to overfitting and poor generalization unless you do additional tricks like dropout training.

"magic" --> x/(1 + |x| + sqrt(|x|))

> 3) I am normalising/scaling my input between 0 and 1. The actual XOR data was -1 to 1. This slightly improved accuracy. Is that common practice?

Scale your input (and requested output) on the same scale your activation function returns. Or scale your activation
function to the same scale you want to use for your input.

When your sigmoid was returning 0 to 1, you were getting better accuracy scaling your data between 0 and 1.
Given sigmoid with a -0.5 adjustment, scaling data to between -0.5 and +0.5 is the better choice. If you
want to scale data from -1 to +1, switching to tanh is a good choice. Or you could also double the
output of your (adjusted) logistic sigmoid and its derivative function.

Alex Ge

unread,

Nov 23, 2015, 11:13:35 PM11/23/15

to

Ray thank you very much, you've given me more information than I had hoped for!

I'll try all your suggestions, I'm going start with shuffling my training data, changing the activation function and then try a bias neuron.

The `sigmoid(x) - 0.5` is what I've seen being called the skew or steepness factor?

Most networks I've tested either had 1 hidden layer and up to 4 hidden layers.
In those networks (with 4 or 3 hidden layers) and due to the fact that I use CUDA's float approximation methods, often the gradients were zero.

Do you know what that could be?
Bear in mind my understanding of Neural Networks is limited.

Last but not least, the original XOR data was in range of [-1,1] whereas my sigmoid was only capable of [0,1]. Is that why it failed to learn? I scaled my input to [0,1] and accuracy slightly improved, but I am guessing the random shuffling will help more.

Alex Ge

unread,

Nov 25, 2015, 1:58:56 PM11/25/15

to

I've tried using the weight init rule: ( -1/√d, 1/√d ) where d is the input neurons.
(http://stats.stackexchange.com/questions/47590/what-are-good-initial-weights-in-a-neural-network)
The output was clearly oscillating:

Epoch 1000 MSE: 0.249641
Epoch 2000 MSE: 0.246542
Epoch 3000 MSE: 0.241256
Epoch 4000 MSE: 0.242956
Epoch 5000 MSE: 0.211567
Epoch 6000 MSE: 0.243413
Epoch 7000 MSE: 0.243224
Epoch 8000 MSE: 0.242227
Epoch 9000 MSE: 0.240674

This is with 0.7 learning rate and 0.3 momentum.

I also followed ur instruction to add more hidden neurons.
So instead of 2 input, 2 hidden (1 hidden layer), 1 output,
I used 2 input, 4 hidden (2 hidden layers), 1 output.
This was so far the only network that learnt the data-set with a very small MSE:

Epoch 1000 MSE: 0.249503
Epoch 2000 MSE: 0.249162
Epoch 3000 MSE: 0.248534
Epoch 4000 MSE: 0.245994
Epoch 5000 MSE: 0.23465
Epoch 6000 MSE: 0.0855746
Trained Network with MSE: 0.0282527

I was a little suspicious how and why the MSE dropped so fast from 0.23465 to 0.0855... So I re-tried the experiment with more epochs:

Epoch 1000 MSE: 0.249958
Epoch 2000 MSE: 0.249892
Epoch 3000 MSE: 0.249814
Epoch 4000 MSE: 0.249762
Epoch 5000 MSE: 0.249612
Epoch 6000 MSE: 0.24911
Trained Network with MSE: 0.00671475

As you can see at the 7000k mark it drops a lot!
However, trying the network with input, gives a different image:

input: [1,0]
output: 0.498424 (instead of 1)
input: [0,1]
output: 0.908918 (instead of 1)
input: [0,0]
output: 0.105613 (instead of 0)
input: [1,1]
output: 0.0854661 (instead of 0)

Therefore, the approximation ability for [1,1],[0,0] and [0,1] is OK, but for some reason [1,0] has trouble.
I am guessing this is because I am not shuffling the training data?
BTW I haven't added bias neurons.

Finally, I tried using `sigmoid(x) - 0.5` and the network failed to learn, in fact the MSE oscillated and sky-rocketed:

Epoch 1000 MSE: 0.500278
Epoch 2000 MSE: 0.500411
Epoch 3000 MSE: 0.500411
Epoch 4000 MSE: 0.500614
Epoch 5000 MSE: 0.500875
Epoch 6000 MSE: 0.501343
Epoch 7000 MSE: 0.503295
Epoch 8000 MSE: 0.504477
Epoch 9000 MSE: 0.509222
Epoch 10000 MSE: 0.294692
Epoch 11000 MSE: 1.1875
Epoch 12000 MSE: 1.1875
Epoch 13000 MSE: 1.1875
Epoch 14000 MSE: 1.1875
Epoch 15000 MSE: 1.1875
Epoch 16000 MSE: 1.1875
Epoch 17000 MSE: 1.1875
Epoch 18000 MSE: 1.1875
Epoch 19000 MSE: 1.1875

This is using the same architecture that learnt the patterns previously.
I'll implement training data random shuffling and see if it helps.

Ray

unread,

Nov 25, 2015, 2:30:05 PM11/25/15

to

If your output nodes can only produce values in the range [0,1] then just read them through a filter
of multiplication by two and subtraction by one.

The vanishing gradient is a common problem with four or deeper layer networks. This happens
because the biases at the lower levels (or the weights from inputs to lower levels) reach
extreme values where the derivative is within a breath of zero before the nodes closer to
output are finished adjusting. If you want to train anything four layers deep, definitely
use the softsign sigmoid - it has gradients much further out than the logistic sigmoid, and
besides it won't get pushed as far by the softer curve that using it at the higher layers
will give it.

alex...@gmail.com

unread,

Dec 10, 2015, 12:27:08 PM12/10/15

to

Ray, I'm puzzled by two other problems.

MSE:

I've implemented MSE as per Jeff Heaton's video: https://www.youtube.com/watch?v=U4BTzF3Wzt0
However, when I train the XOR network, although the MSE drops very low (say 0.007) when I later test it using the XOR data, it goes up to 0.50!

Even more interestingly, every now and then, during training (back-prop) the MSE will be very low and the training loop will stop, but when I test it by propagating individual input vectors, I can see that it doesn't always work.

How exactly should I be implementing MSE? I mean the actual formula or algorithm. AFAIK it is:

(target output - actual output)^2 / (# of patterns) * (#of outputs)

So for each output node, I obtain the actual output and the target output,
and then I square the difference.
I add all squares and then divide by the number of patterns times the output nodes.

I' contemplating switching to average cross-entropy or other error functions.
Other larger and more complex networks (such as abelone, diabetes or gene) never converge using back-prop and the MSE seems to range between 5.0 and 1.5 at best case scenario.

Back-Prop:
So is the failure of learning attributed to back-prop, me measuring the MSE wrong, or should I also try implementing R-Prop as well, in order to figure out what's going on?

I would greatly appreciate any help on the matter.

Ray

unread,

Dec 10, 2015, 11:30:05 PM12/10/15

to

If you're training on XOR data, and getting a very low error rate, then when you test on XOR
data there is NO reason why you shouldn't get exactly the same error rate. You can't learn
the pattern if any of the whole set of input is left out, so none can be withheld as separate
"training" vs. "testing" data because there is no way to learn the pattern without all of them.
So you'll be testing on the same data you trained on, and you ought to be getting exactly the
same MSE result.

It sounds like there is some basic math difference (bug!) between what your implementation
is actually doing when training vs. when testing.

Bear

alex...@gmail.com

unread,

Dec 11, 2015, 12:00:42 AM12/11/15

to

Hi Ray,

Yes there was a bug, I fixed it, and not I am getting (almost) the same error.
CUDA has "approximate" fast math (hardware math operators) which not always produce the same result.

My problem is that MSE is going crazy with other (not the XOR network) when using back-prop. I've tried various learning rates and momentum values, either a small learning rate and high momentum, or high learning and small momentum,
using either tanh, tanh scaled (to [-1,1]) or soft-sign.

I see small networks (8 input, 2 output no hidden) have fluctuating MSEs, and I am wondering if I am using MSE correctly, hence my previous post.

Is that indeed the correct way to use it?
Do I have to square all output node differences, and then divide by the number of samples/patterns * output nodes?

Are there any advantages in implementing average cross entropy?

I am looking into implementing r-prop as my next task, and in the meantime doing some small code optimizations, but I am worried because none of my networks (with the exception of the XOR networks) are converging on what should be a reasonable MSE.

Ray

unread,

Dec 12, 2015, 3:15:05 PM12/12/15

to

alex...@gmail.com wrote:
> Hi Ray,
>
> Yes there was a bug, I fixed it, and not I am getting (almost) the same error.
> CUDA has "approximate" fast math (hardware math operators) which not always produce the same result.

> My problem is that MSE is going crazy with other (not the XOR network) when using back-prop. I've tried various learning rates and momentum values, either a small learning rate and high momentum, or high learning and small momentum,
> using either tanh, tanh scaled (to [-1,1]) or soft-sign.

Check for one minor bug: If you're using momentum, make sure your bias connections
are NOT subject to it. It never helps there, and can cause oscillations and
instabilities in some cases. Also be wary of setting your momentum constant too
high. It should never ever be more than 1-2/n where n is the number of examples
you're training on.

Many systems with momentum automatically set (all) momentum to zero at the end
of any training run where the training error has increased instead of decreasing.
This is *as* mathematically correct as using momentum in the first place and
usually seems to get better results.

When in doubt use batch training instead of momentum.

Batch training means when you do backpropagation, you don't change the weights.
Instead, just add the change to a running sum for that weight. Then after a
"batch" of inputs and backpropagations, add the running sums to the weights
all at once, with a very small learning ratio and no momentum. Stochastic
gradient descent without momentum == batch size 1, and can easily get stuck
on diagonal gradients, ridges, etc.

"one batch = all training data" is the most precise mathematical definition of
the correct behavior. The only reason we don't do everything like that is
because it's extremely slow.

Okay, here's a smoke test for a serious bug: Try using way too many hidden nodes
(like, one per case). This should result in rapid convergence to zero MSE on training
cases (and normally, absolutely no ability to handle non-training cases, but I digress).
If it doesn't -- if MSE on training cases under those conditions doesn't go to zero --
then something in your code is seriously borked.

Usually with too many hidden nodes, you get a network that can overfit the training
data instead of learning real rules that will generalize to be appropriate for
testing data. That's exactly what the smoke test above invoked. The flipside of
it is that with not enough hidden nodes, your network won't be able to learn any
remotely-complex patterns. So the balancing act that neural network designers
are always trying to do is to get the right number of hidden nodes to be able to
learn general rules, but not so many that it can learn special rules just to take
care of individual training cases. You've got it right when learning the general
rules is just barely within its capacity; under that circumstance it will perform
only a few percent better, if that, on the training data. So the next test is
making sure that actually works.

Second test: Try reducing the number of hidden nodes and training until you get
"acceptable" error rates on the training data, then check your testing data. You'll
want to do this several times (to find averages) at each reduced number of hidden
nodes. What you should observe here is that performance on training data gets
slightly worse as the number of hidden nodes is reduced, but performance on testing
data gets better. When an "average" training results in a network with a score on
testing data that's within a few percent of the score on training data, you've
found the right number of hidden nodes. But if that error rate is unacceptably
high, it means you may need a deeper network to solve this problem.

"Normal" use of a neural network is early stopping. That is, early in training
you'll see the training and testing errors declining at about the same rate, then
at some point the network starts to overfit and you see your error on testing data
starts consistently rising while error on training data continues to fall. At
that point you stop training. This works (reasonably well anyway) even if you
have more than enough hidden nodes.

There is another way to fight overfitting. You can use dropout training, and
deliberately use too many hidden nodes. It seems ridiculous, but works very well.
In dropout training, you randomly pick half the hidden nodes for each example and
force their output to zero. Then double the output of all the other nodes. When
you have trained the network, use all the nodes and normal output rates. To use
dropout training you need more nodes, but overfitting is nearly impossible and the
accuracy is better than with early stopping.

Ray

unread,

Dec 12, 2015, 5:50:05 PM12/12/15

to

An important thing I forgot, which would also explain your bugses:

Make sure that the partial derivative with respect to error that you're
using is CORRECT!

Sounds crazy? It's one of the most common bugs. Hard to catch because
it still allows convergence in simple cases like XOR. Lots of people
wind up thinking neural nets are no good, when the problem was just
that they had the wrong derivative function.

Anyway, add the function derivcheck to your program.

/* dunno what language you're using, but this is C syntax.

The first argument is a pointer to a function. The function
it points to must take one floating point argument and return
a floating point value.

and the second argument darg is the value at which we want the
partial derivative.*/

float derivcheck( float *(activ)(float), float darg){
float a1 = darg + 0.001;
float a2 = darg - 0.001;
float rise = activ(a2) - activ(a1);
float run = 0.0002;
return (rise/run);
}

and then somewhere in your setup code just to make sure you've defined
the derivative function right, you can use this:

/* again, C syntax; this assumes you've got an activation
function named activ and what's supposed to be the derivative
of that function is named deriv. This uses derivcheck to
make sure that deriv is "close to" the right value at ~two
thousand points from -10 to +10. */

for (float arg = -10.0; arg <= 10.0; arg+= 0.01){
float der = deriv(arg);
float dcheck = derivcheck(activ, arg);
assert(der + 0.001 > dcheck && der - 0.001 < dcheck);
}

If there's only one activation function, you can skip messing around
with passing a function pointer and just call it directly from derivcheck.

Bear

alex...@gmail.com

unread,

Dec 13, 2015, 11:06:21 PM12/13/15

to

Thanks Ray, I will triple check the derivatives (I know I had issues with them before).

What really puzzles me is this I have spent days (if not weeks) testing this code, and the only real explanation I can give is that there is either some sneaky bug somewhere, or that I haven't really understood how neural networks work.

Even worse, it appears to happen randomly:

XOR network using tanh_norm back-prop MSE: 0.00033083
test [1,0] as input; output: -1 (expecting 1)
test [0,1] as input; output: 0.994896 (expecting 1)
test [0,0] as input; output: -0.00251865 (expecting 0)
test [1,1] as input; output: 0.00251865 (expecting 0)
XOR network test MSE: 0.000330795

The second test should had thrown off the MSE by a lot, yet it didn't!
Unless I can filter it by taking the absolute value, but it still troubles me.

This is when using tanh and its derivative:

σ(x) = e^(x) - e^(-x) / e^(x) + e^(-x)
σ'(x) = 1 / cosh^2(x)

I've also tried a scaled version of tanh (normalised between [-1,1]:

σ(x) = 1.7159 * tanh( (2/3) *x)
σ'(x) = 0.6667/1.7159 * (1.7159 - σ(x)) * (1.7159 + σ(x) )

I have also tried with the classic sigmoid:

σ(x) = 1 / 1 + e^( -x )
σ'(x) = σ(x) * (1 - σ(x) )

Which quite often fails to converge for the XOR problem (100,000 epochs).

I also tried the sigmoid bipolar:

σ(x) = -1 + 2 / (1 + e^-x)
σ(x) = 0.5 * (1 + σ(x)) * (1 – σ(x) )

Which seems to work a bit better than the classic sigmoid.
For any other dataset that I've tried, the MSE (if I'm using it correctly) simply stays too high.

Bear in mind I've only implemented and uses back-prop, I am about to give r-prop a try as well.

At this point I am seriously wondering if I should be using the Average Cross Entropy often used for classification rather than approximation.

Also, maybe I should try soft-maxing the output?

Ray

unread,

Dec 14, 2015, 12:25:04 AM12/14/15

to

alex...@gmail.com wrote:
> This is when using tanh and its derivative:

> I've also tried a scaled version of tanh (normalised between [-1,1]:

> I also tried with the classic sigmoid

> I also tried the sigmoid bipolar:

Sorry, I shouldn't have gone off on sigmoid functions in my first reply to you;
ANY of these is just fine for a network of one hidden layer. If you've got
one, and the right derivative for it, then that isn't the problem.

I've been thinking hard about sigmoids lately because I'm trying to build deep
networks, and tanh and the classic logistic sigmoid don't work very well in
*THAT* case. But for the basic XOR smoke test in a network with one hidden
layer? They're FINE!

Your problem is how you're calculating mean squared error. When you have P
and wanted S, your squared error for that case is (S - P)^2, and it looks
like your code is calculating S^2 - P^2 or P^2 - S^2.

Bear
PS. My news reader is confused about whether or not the other reply
I just sent got through; so if this is redundant, please bear with
me.

alex...@gmail.com

unread,

Dec 14, 2015, 12:13:32 PM12/14/15

to

Hi Ray,

No need to apologise, I was hoping to examine different activation functions, although I must admit I don't completely understand how the affect the network.

I think I've solved my bug, it was an issue with trying to get the output too early from the GPU, the host (CPU) has to wait for it to finish.

So, XOR networks (Tanh, Tanh scaled) seem to work.
A XOR with Sigmoid fails to converge (I've tried up to 500,000 epochs with various learning and momentum rates). Sigmoid bipolar XOR works some times.

My problem starts with me not fully understanding MSE's properties.
Is it supposed to be a percentile error, e.g., 0.2 MSE = 20% error?
Jeff Heaton suggests so in his YouTube videos.

Or is it dependant on sample size? (that is my understanding)

I believe I'm calculating it correctly, see for your self: https://github.com/alexge233/cuANN/blob/dev/src/kernel/kernel.cu#L137
I then proceeds to divide it by the number of output nodes * sample size:
https://github.com/alexge233/cuANN/blob/dev/src/ann/ann.hxx#L174

Finally, is it so easy for a network to fail to converge when using back-prop?

Ray

unread,

Dec 14, 2015, 10:15:05 PM12/14/15

to

alex...@gmail.com wrote:
>
> So, XOR networks (Tanh, Tanh scaled) seem to work.
> A XOR with Sigmoid fails to converge (I've tried up to 500,000 epochs with various learning and momentum rates). Sigmoid bipolar XOR works some times.

> My problem starts with me not fully understanding MSE's properties.
> Is it supposed to be a percentile error, e.g., 0.2 MSE = 20% error?
> Jeff Heaton suggests so in his YouTube videos.

> Or is it dependant on sample size? (that is my understanding)

Your mean squared error is the sum of squared-error for every case,
divided by the number of cases.

So, if you have sqared-error of 4,3,2,and 1, your mean squared error
would be 2.5, The *actual* error in that case might be something like
-2, sqrt(3), -sqrt(2), -1. Squaring means ignoring the sign of the
actual errors.

> I believe I'm calculating it correctly...

If you ever have a case where the error is 2 (the difference between +1 and -1)
and the squared-error for that case isn't 4, you have a math problem. If there
are 4 cases, and one of them has squared-error of 4, the mean squared error
cannot under any circumstances be less than 1.

> Finally, is it so easy for a network to fail to converge when using back-prop?

On a one-hidden-layer network, convergence ought to be pretty automatic. That said:

The theory of gradient-descent assumes continuous adjustments so you can smoothly
follow the gradient to the highest fitness point. The practice, however, involves
learning rates where the weights are adjusted by some definite amount at discrete
events, and these sometimes invalidate the theory's assumptions - If your learning
rate is too large, your adjustment may go right on past the solution, and hit a
steeper gradient that causes the next adjustment to take it further away, etc.

It's normal to start with a higher learning rate to get it "close to" the solution,
then reduce it during training to reduce this kind of bouncing-past behavior and
let it settle closer to a solution.

The theory also assumes you are measuring the MSE for the whole data set at every
point along the way, and in practice most people train on batches of data much
smaller, or use momentum to approximate batch training, and that can cause its own
set of (usually smaller) problems, some of which are made less severe by choosing
cases to present in random order. In practice with random training case order and
momentum, these are not usually problems for networks of 2 hidden layers or less.

For a one-hidden layer network? If you have a problem compatible with a one-layer
solution, and a few more than the number of hidden-layer weights needed to
express the minimal solution, and anything even remotely reasonable for a learning
rate - it shouldn't be possible to get it to NOT converge

When you start working on deeper networks, "small enough" learning rates become
a real problem because you get exploding gradients where a very small adjustment
of a weight connecting input to first hidden layer will cause enormously steep
gradients to emerge at layers closer to output, and those drive large weight
adjustments which send weights "out of range" to outlying places where there is
almost *NO* gradient. That effectively undoes and stops training for those
weights. Deeper networks than 2 hidden layers were considered impossible for a
long time due to the exploding gradient problem - it was effectively impossible
to train them.

That's why in deeper networks you learn to heavily favor non-exponential logistic
functions such as softsign, and use adhoc methods such as adjustment limits, etc.
Lots of people training deeper networks give up on logistics entirely and use
non-sigmoid activation functions, or resort to other methods instead to train
the weights.

Bear

alex...@gmail.com

unread,

Dec 15, 2015, 12:49:44 AM12/15/15

to

Hi Ray,

> Your mean squared error is the sum of squared-error for every case,
> divided by the number of cases.
>
> So, if you have sqared-error of 4,3,2,and 1, your mean squared error
> would be 2.5, The *actual* error in that case might be something like
> -2, sqrt(3), -sqrt(2), -1. Squaring means ignoring the sign of the
> actual errors.

That is what I am doing.
I hadn't understood (silly me) that squaring them would ignore the sign.
However, that is an issue as some times my output is negative.
I guess I could filter my output using the absolute value?

> > Finally, is it so easy for a network to fail to converge when using back-prop?
>
> On a one-hidden-layer network, convergence ought to be pretty automatic. That said:
>

So what other reason (other than the MSE) could there be that I can't train anything other than a XOR network? Insufficient hidden neurons?
The same code that works for XOR (TANH, TANH SCALED, SIGMOID) fails for all other networks.

> The theory of gradient-descent assumes continuous adjustments so you can smoothly
> follow the gradient to the highest fitness point. The practice, however, involves
> learning rates where the weights are adjusted by some definite amount at discrete
> events, and these sometimes invalidate the theory's assumptions - If your learning
> rate is too large, your adjustment may go right on past the solution, and hit a
> steeper gradient that causes the next adjustment to take it further away, etc.
>
> It's normal to start with a higher learning rate to get it "close to" the solution,
> then reduce it during training to reduce this kind of bouncing-past behavior and
> let it settle closer to a solution.
>

OK, I have not been doing that, instead my learning rate and momentum is fixed throughout all epochs.

So, I would gradually adjust the learning rate.
Should that adjustment be linear or ???

> The theory also assumes you are measuring the MSE for the whole data set at every
> point along the way, and in practice most people train on batches of data much
> smaller, or use momentum to approximate batch training, and that can cause its own
> set of (usually smaller) problems, some of which are made less severe by choosing
> cases to present in random order. In practice with random training case order and
> momentum, these are not usually problems for networks of 2 hidden layers or less.
>

I do batch training. I measure MSE for each epoch (a full iteration of the training samples), and then reset it.

At every iteration I randomly shuffle the training samples so that they are not presented to the network in the same order.

Typical learning rate and momentum I've used:

7,1
7,2
7,3
2,7
2,9

> For a one-hidden layer network? If you have a problem compatible with a one-layer
> solution, and a few more than the number of hidden-layer weights needed to
> express the minimal solution, and anything even remotely reasonable for a learning
> rate - it shouldn't be possible to get it to NOT converge
>

Only thing that seems to converge is XOR networks nothing else.
I have been limiting them to 500,000 epochs, but I notice that the
MSE gets stuck at the same value for all epochs.

> When you start working on deeper networks, "small enough" learning rates become
> a real problem because you get exploding gradients where a very small adjustment
> of a weight connecting input to first hidden layer will cause enormously steep
> gradients to emerge at layers closer to output, and those drive large weight
> adjustments which send weights "out of range" to outlying places where there is
> almost *NO* gradient. That effectively undoes and stops training for those
> weights. Deeper networks than 2 hidden layers were considered impossible for a
> long time due to the exploding gradient problem - it was effectively impossible
> to train them.
>
> That's why in deeper networks you learn to heavily favor non-exponential logistic
> functions such as softsign, and use adhoc methods such as adjustment limits, etc.
> Lots of people training deeper networks give up on logistics entirely and use
> non-sigmoid activation functions, or resort to other methods instead to train
> the weights.
>
> Bear

I will keep that in mind once I get to that stage. For now I am still struggling with the elementary stuff.

BTW, you haven't told me about your opinion on average cross entropy or soft-max output activation?

Regards,
Alex

bear

unread,

Dec 15, 2015, 9:00:07 PM12/15/15

to

alex...@gmail.com wrote:
: Hi Ray,

: I hadn't understood (silly me) that squaring them would ignore the sign.

: However, that is an issue as some times my output is negative.
: I guess I could filter my output using the absolute value?

Yes. The adjustment is e * abs(e).

: OK, I have not been doing that, instead my learning rate and momentum is fixed throughout all epochs.

: So, I would gradually adjust the learning rate.
: Should that adjustment be linear or ???

Doesn't really matter. Most people do it in steps.

: I do batch training. I measure MSE for each epoch (a full iteration of the training samples), and then reset it.

: At every iteration I randomly shuffle the training samples so that they are not presented to the network in the same order.

Random shuffle isn't needed, at all, if your batch for batch training is the whole
data set. Harmless though.

: Typical learning rate and momentum I've used:

: 7,1
: 7,2
: 7,3
: 2,7
: 2,9

Say WHAT?!

A learning rate of 2 to 7 means adjusting by two to seven TIMES the error.
It would be astonishing if ANYTHING converges if you do that. That'll make
the adjustments overshoot your target and land further away on the other
side than you started! 0.1 is considered a large learning rate. Reducing to
0.01 toward the end to refine the answer.

If you're batch training with the whole data set, your momentum should be 0.
There is absolutely no need for momentum in that case, because there are no
"diagonals" with needed examples missing from the current backpropped data
set. Momentum is there to correct problems that happen with very small
batches (like batches of one, for example) where different weights all need
to move simultaneously to make progress (a "diagonal") and no individual
example moves all of those weights. Batch training does the same thing better.

And what does a momentum constant >1 even MEAN? The momentum constant is what
fraction of the momentum is conserved from one training to another. Each
training, the momentum is multiplied by the momentum constant and the current
error is added to it. With a momentum constant of 1, you're training based
on ALL past and present error, every time. With a momentum constant greater
than 1, you'd be exponentially increasing the momentum by multiplying it
by that number every round.

Even if you were training after every example, there would NEVER be a need
for a momentum constant greater than 0.5 in training with only 4 examples
with examples randomly drawn.

Different people do this differently, but as I run things in my setups, the
learning rate is reduced by the fraction of momentum carried forward. So if
my learning rate were 0.1 and I was using 0.6 momentum, I'd be adjusting by
0.04 (learning rate x (1 - momentum)) times the (momentum + current error).
If my learning rate were 0.1 and I was using 0.9 momentum, I'd be adjusting
by 0.01 times the (momentum + current error). The reason why I do that is
to make the training per batch proportional to the learning rate, regardless
of where momentum is set, so the learning rate means the same thing in systems
with different momentum.

: BTW, you haven't told me about your opinion on average cross entropy or soft-max output activation?

Average cross entropy isn't feasible for the size network I'm working on, so
I haven't looked at it very much. Soft-max output activation is just a way
to turn output activation levels into estimates of percentage chances or
estimates of percentage proportions, depending on the problem. Neither is a
crucial concept IMO.

Bear

alex...@gmail.com

unread,

Dec 16, 2015, 10:19:07 AM12/16/15

to

Hi Ray,

I meant to write:

.7, .1 (e.g., 0.7,0.1)

and so on, but somehow (probably due to tiredness) wrote it without the decimal.

However what you're describing is starting to make sense (a lot of sense) to me.
I've been trying with huge momentum and learning rates.
The reason being that many (and I mean many) resources on the internet state that I can/should/could use learning rates up to (2.0)!!! and momentum rates up to (1.0)!

The huge momentum would in effect explain why I see the MSE vary wildly up and down?

And the large learning rate could explain why all my networks fail to converge?

Even worse, a combination of large learning rates and momentum rates could explain why I've failed miserably to train anything else other than a XOR?

I'm going to run all my data-set experiments using learning rate 0.1 and momentum 0.01 and see what happens.

I will also implement your formula for diminishing the learning rate (is this yours or a reference from a paper?).

By the way I can't begin to describe how thankful I am for your help!

Ray

unread,

Dec 16, 2015, 10:40:06 PM12/16/15

to

In article <ed648c9e-2b55-4d7a...@googlegroups.com> you wrote:
> Hi Ray,
>
> I meant to write:
>
> .7, .1 (e.g., 0.7,0.1)
>
> and so on, but somehow (probably due to tiredness) wrote it without the decimal.

Okay. A learning rate that large is a mistake, even for a network that's only
set up for the XOR problem, but at least it's not totally insane like >1.

> I've been trying with huge momentum and learning rates.

> The reason being that many (and I mean many) resources on the internet state that I can/should/could use learning rat$

A momentum of 1.0 is ALWAYS insane. Those people are wrong. If you were doing
backprop on single examples and then training after every example, in a problem
with literally thousands of examples, then a momentum of 0.995 or so could be
used, but dynamically speaking there is a WORLD of difference between a tiny
bit less than one and exactly one. One of them is a running average, and the
other is a running sum. And with batches == whole data set, there is no need for
momentum ever because your error will always point in the correct direction.

A learning rate of 2.0 is unstable at any point in the fitness landscape having
a slope greater than 1/2. But the sigmoids you're using have slope of 1 at
the zero point, so even in a trivial 1-layer network with 1 weight that's not
stable unless the solution requires the weights to be abs(weight) > (wherever
the slope goes to 1/2 or less). If 2 weights have to be set for a particular
solution, the fitness landscape has a contour with a slope greater than sqrt(2).
(single layer - contour with greatest slope has slope equal to the square root
of the number of weights).

If you go to more than one layer, it's a multiplier effect instead. The greatest
slope in the fitness landscape will be the greatest slope available on the bottom
layer times the greatest slope available on the top layer.

We never actually use learning rates low enough to be stable at *EVERY* point in our
fitness landscape. To some extent, we rely on our solutions lying well away from the
"pathological" areas of the fitness landscape. But if your solution is close to one
of the areas where your learning rate is unstable, you will not get convergence on
that solution; every time your solution vector wanders into the unstable area when
getting close to the solution, it'll get pitched away from the area by the steep
slope, instead of going closer to the solution. So you do have to use a learning
rate small enough to be stable on all of the areas of the fitness landscape that
are even close to where a solution might be found.

This is the "exploding gradient" problem - why we never used to be able to train
deep networks until we figured out stacked autoencoders followed by refinement
using softsign etc. When your network is even moderately deep, any learning rate
you could reach a solution with in under, say, a year using backprop, is unstable
on too large a fraction of the fitness landscape to be able to find a solution.
And almost every possible solution point, even where your learning rate is stable,
is too close to one of these pathologically steep contours to ever converge.

> The huge momentum would in effect explain why I see the MSE vary wildly up and down?

Either the high learning rate or the high momentum could cause overshooting the
solution. And yes, the MSE varying wildly up and down is a typical symptom.

> And the large learning rate could explain why all my networks fail to converge?

Yes it would. Instability over a large area of the fitness landscape means
non-convergence on any solution within or even close to those areas.

> I'm going to run all my data-set experiments using learning rate 0.1 and momentum 0.01 and see what happens.

I predict you'll either have success, or else find a completely *different* bug. :-)

> I will also implement your formula for diminishing the learning rate (is this yours or a reference from a paper?).

That one's mine. I just got tired of having to multiply the reciprocal of 1
minus the momentum by the learning rate to find out the greatest possible
weight adjustment, so I "fixed" it. With that rule in place, the greatest
possible weight adjustment == the learning rate, so calculating or setting
it is trivial.

Bear