Dealing with 2 local minima

TomH488

unread,

Apr 18, 2013, 1:15:43 PM4/18/13

to

I have a 3 layer MLP that predicts rather well but seems to have noise
in the output.

Upon closer inspection, there appear to be "two trajectories" that
predictions fall on - as they alternate back and forth, it looks like
noise.

I believe I am seeing 2 local minima appear in my predictions.

Is there anything in particular to deal with a problem of TWO local
minima?

Thanks in advance,
Tom

Greg Heath

unread,

Jun 3, 2013, 9:33:30 AM6/3/13

to

A few nonrandom thoughts:

Renormalization of variables

Smaller step size

A different optimization algorithm

Hope this helps.

Greg

Tomasso

unread,

Jun 3, 2013, 9:57:19 PM6/3/13

to

"TomH488" <tom...@gmail.com> wrote in message news:8aebbf84-5a72-4d42...@m4g2000yqi.googlegroups.com...

If there are two local minima, you would converge on one or the other depending on initial weights.

You are simply not converging.

Possibly, if you are using online learning (ie, updating weights after each pattern), something in the order of your
data is throwing you from one convergence manifold to another. If so, tey batch learning, or stochastic.

More likely your learning rate is too high. Reduce it, and experiment with relaxation.

marcus....@gmail.com

unread,

Jun 18, 2013, 4:34:23 PM6/18/13

to

Relaxation is key

Greg Heath

unread,

Jul 2, 2013, 12:24:22 AM7/2/13

to

On Tuesday, June 18, 2013 4:34:23 PM UTC-4, Marcus Appelros wrote:
> Relaxation is key

What do you mean by relaxation? Weight decay?

Greg

Message has been deleted

Marcus Appelros

unread,

Jul 18, 2013, 7:28:30 AM7/18/13

to

> More likely your learning rate is too high. Reduce it, and experiment with relaxation.

Relaxing in general is very good, it brings clearer thoughts. Make some stretches and go for a bit of fresh air, also check the network for spots of high information flow and see if you can bring some relief to these nodes.

TomH488

unread,

Aug 16, 2013, 1:11:43 PM8/16/13

to

On Thursday, April 18, 2013 1:15:43 PM UTC-4, TomH488 wrote:

Sorry for the belated reply:

The neural network platform I'm using is NeuroShell 2 which is fairly basic. They have a black box training method called TurboProp I that simply does not work a fair number of times.

So that leaves Momentum - my only other option.

After a lot of empirical experimentation, needing a method that does not require continuous user tweaking, I settled upon a momentum of .9 (0 to 1 allowed) and the largest learning rate that does not blow up initially. .01 being typical.

I also did not use sequential case training, but random. So in this case, Momentum acts as a "random perturber" as each case is trained upon.

Initial weights typically .3 (+-.3) uniform distribution were used. (uniform is only option)

Then I would use 6 trainings with different random number seeds to "smooth out the results."

I figure keeping the momentum really high is like a Poor Man's Similated Annealing.

I also found that with about 50% more hidden nodes (100) than inputs (70), I get my best results after only 10 epochs of training. Maybe 20-30 epochs max. But any more, chaos really shows up.

I wish I could dither (add fuzz) each training case each time it is processed since I believe this is an extremely robust method to smooth the solution. The fact that the net never sees the same case twice really should help with generalization and not memorization. Future, the "fuzz" sort of sets a slope constraint between case points that should really reduce interpolation haywir-ed-ness (make it smoother!) But unless I write my own code, that isn't going to happen. (Frankly, I think I would be a lot better off using all the code Timothy Masters has published along with his books which I think are grossly under-utilized. What can be better than having the source code? And what a great work to have an active forum on. [Maybe I should try to set one up on Google Groups.])

I'm also way behind the curve on picking lags. Have been looking at cross-correlations of input to output but really need to get an understand of PACF and its usage. Of course these are all based on linear problems but still, I believe they are relevant. The Big Thing is that if you have an input that effects the output 30 days into the future, you better have a 30 day lag. But as soon as you say that, what if you were using differencing to deal with nonstationarity? Of what use is a difference that is far distant from the output due to a big lag? Its like you need a 30 day difference or just raw input.

Lots of problems and issues. I could keep 10 of me working without end. But my time is a limiting factor.

I am very close actually. If I could simply get smoother results (stock market prediction) between trainings, I would be looking at yachts. Even looked into Kalman filters which are quite amazing and way above my pay grade. And their combination with neural networks is quite amazing. It seems like the weight matrix of a net becomes one of the Kalman matrices! Wow...!

So I'm stuck with my platform and just need to find one more improvement to get the drawdowns under control.

And I'm not even talking about the trading model! HaHaHaaa

Thanks so much, as always, in advance,
Tom