alex...@gmail.com wrote:
> Hi Ray,
>
> Yes there was a bug, I fixed it, and not I am getting (almost) the same error.
> CUDA has "approximate" fast math (hardware math operators) which not always produce the same result.
> My problem is that MSE is going crazy with other (not the XOR network) when using back-prop. I've tried various learning rates and momentum values, either a small learning rate and high momentum, or high learning and small momentum,
> using either tanh, tanh scaled (to [-1,1]) or soft-sign.
Check for one minor bug: If you're using momentum, make sure your bias connections
are NOT subject to it. It never helps there, and can cause oscillations and
instabilities in some cases. Also be wary of setting your momentum constant too
high. It should never ever be more than 1-2/n where n is the number of examples
you're training on.
Many systems with momentum automatically set (all) momentum to zero at the end
of any training run where the training error has increased instead of decreasing.
This is *as* mathematically correct as using momentum in the first place and
usually seems to get better results.
When in doubt use batch training instead of momentum.
Batch training means when you do backpropagation, you don't change the weights.
Instead, just add the change to a running sum for that weight. Then after a
"batch" of inputs and backpropagations, add the running sums to the weights
all at once, with a very small learning ratio and no momentum. Stochastic
gradient descent without momentum == batch size 1, and can easily get stuck
on diagonal gradients, ridges, etc.
"one batch = all training data" is the most precise mathematical definition of
the correct behavior. The only reason we don't do everything like that is
because it's extremely slow.
Okay, here's a smoke test for a serious bug: Try using way too many hidden nodes
(like, one per case). This should result in rapid convergence to zero MSE on training
cases (and normally, absolutely no ability to handle non-training cases, but I digress).
If it doesn't -- if MSE on training cases under those conditions doesn't go to zero --
then something in your code is seriously borked.
Usually with too many hidden nodes, you get a network that can overfit the training
data instead of learning real rules that will generalize to be appropriate for
testing data. That's exactly what the smoke test above invoked. The flipside of
it is that with not enough hidden nodes, your network won't be able to learn any
remotely-complex patterns. So the balancing act that neural network designers
are always trying to do is to get the right number of hidden nodes to be able to
learn general rules, but not so many that it can learn special rules just to take
care of individual training cases. You've got it right when learning the general
rules is just barely within its capacity; under that circumstance it will perform
only a few percent better, if that, on the training data. So the next test is
making sure that actually works.
Second test: Try reducing the number of hidden nodes and training until you get
"acceptable" error rates on the training data, then check your testing data. You'll
want to do this several times (to find averages) at each reduced number of hidden
nodes. What you should observe here is that performance on training data gets
slightly worse as the number of hidden nodes is reduced, but performance on testing
data gets better. When an "average" training results in a network with a score on
testing data that's within a few percent of the score on training data, you've
found the right number of hidden nodes. But if that error rate is unacceptably
high, it means you may need a deeper network to solve this problem.
"Normal" use of a neural network is early stopping. That is, early in training
you'll see the training and testing errors declining at about the same rate, then
at some point the network starts to overfit and you see your error on testing data
starts consistently rising while error on training data continues to fall. At
that point you stop training. This works (reasonably well anyway) even if you
have more than enough hidden nodes.
There is another way to fight overfitting. You can use dropout training, and
deliberately use too many hidden nodes. It seems ridiculous, but works very well.
In dropout training, you randomly pick half the hidden nodes for each example and
force their output to zero. Then double the output of all the other nodes. When
you have trained the network, use all the nodes and normal output rates. To use
dropout training you need more nodes, but overfitting is nearly impossible and the
accuracy is better than with early stopping.