MLP Mean & Var of Predictions are 0.4 times the Training Values

TomH488

unread,

Sep 22, 2015, 1:25:03 PM9/22/15

to

Both Mean and Standard Deviation are .40 of the training values.

The Output is binary (-1 or +1).

Symmetric Logistic used on the Hidden layer.
Tanh used on the Output layer.

Early stopping with 20% Test Set.
______________________

Just thinking now perhaps I should try Linear on the Output Layer?

I have no other ideas.

TomH488

unread,

Sep 22, 2015, 1:26:59 PM9/22/15

to

I could see a scale of the StDev, but a shift in the Mean is what really perplexes me.

Greg Heath (alumni.brown.edu)

unread,

Sep 23, 2015, 5:20:14 AM9/23/15

to

On Tuesday, September 22, 2015 at 1:25:03 PM UTC-4, TomH488 wrote:

> Both Mean and Standard Deviation are .40 of the training values.
>
> The Output is binary (-1 or +1).
>
> Symmetric Logistic used on the Hidden layer.
> Tanh used on the Output layer.

What is the difference between a symmetric logistic and tanh ?

> Early stopping with 20% Test Set.

The validation subset is used for Early Stopping.

The test subset is used to get an unbiased estimate on nontraining
( validation, test and unseen ) data

> Just thinking now perhaps I should try Linear on the Output Layer?

I don't think that is your problem.

> I have no other ideas.

[ I N ] = size(input) = ?
[ O N ] = size(target) = ?
H = number of hidden nodes = ?
(Ntrn/Nval/Ntst)/N = /?/0.2/?
Ntrneq = Ntrn*O = number of training equations?
Nw = (I+1)*H +(H+1)*O = number of unknown weights ?
computer language = ?
How many random initial weight designs ?

Greg

Greg Heath (alumni.brown.edu)

unread,

Sep 23, 2015, 5:27:54 AM9/23/15

to

On Tuesday, September 22, 2015 at 1:25:03 PM UTC-4, TomH488 wrote:

What is your normalized mean-square-error?

NMSE = mse(target-output)/ mean(variance(target))

Greg

Stephen Wolstenholme

unread,

Sep 23, 2015, 9:01:16 AM9/23/15

to

On Tue, 22 Sep 2015 10:24:58 -0700 (PDT), TomH488 <tom...@gmail.com>
wrote:

>Both Mean and Standard Deviation are .40 of the training values.
>
>The Output is binary (-1 or +1).
>
>Symmetric Logistic used on the Hidden layer.
>Tanh used on the Output layer.

I find Logistic works in any layer and it is just a bit faster than
tanh.

>
>Early stopping with 20% Test Set.

Use a validation set.

>______________________
>
>
>Just thinking now perhaps I should try Linear on the Output Layer?
>
>I have no other ideas.
>

Logistic works in any layer if it is limited to the near linear part
of the curve.

Steve

--
Neural Network Software for Windows http://www.npsnn.com

EasyNN-plus More than just a neural network http://www.easynn.com

TomH488

unread,

Sep 23, 2015, 8:47:35 PM9/23/15

to

On Wednesday, September 23, 2015 at 5:20:14 AM UTC-4, Greg Heath (alumni.brown.edu) wrote:
> On Tuesday, September 22, 2015 at 1:25:03 PM UTC-4, TomH488 wrote:
> > Both Mean and Standard Deviation are .40 of the training values.
> >
> > The Output is binary (-1 or +1).
> >
> > Symmetric Logistic used on the Hidden layer.
> > Tanh used on the Output layer.
>
> What is the difference between a symmetric logistic and tanh ?

Symmetric Logistic is like the tanh but not as sharp (it's a logistic that is scaled to [-1, 1]

>
> > Early stopping with 20% Test Set.
>
> The validation subset is used for Early Stopping.
>
> The test subset is used to get an unbiased estimate on nontraining
> ( validation, test and unseen ) data
>
> > Just thinking now perhaps I should try Linear on the Output Layer?
>
> I don't think that is your problem.
>
> > I have no other ideas.
>

> [ I N ] = size(input) = 70
> [ O N ] = size(target) = 1
> H = number of hidden nodes = 40
> (Ntrn/Nval/Ntst)/N = (800/200/8)/1000
> Ntrneq = Ntrn*O = number of training equations = 800*1 = 800
> Nw = (I+1)*H +(H+1)*O = number of unknown weights 71*40 + 41*1 = 2881
> computer language = Delphi code which keystrokes NeuroShell 2 GUI
> How many random initial weight designs = averaging 6 trainings, each with a different randomly chosen Val set and randomly chosen IW's.
>
> Greg

NOTE: I have a LINEAR Output run running tonight. Hopefully it will complete without issue.

TomH488

unread,

Sep 23, 2015, 10:38:04 PM9/23/15

to

On Wednesday, September 23, 2015 at 9:01:16 AM UTC-4, Stephen Wolstenholme wrote:
> On Tue, 22 Sep 2015 10:24:58 -0700 (PDT), TomH488

Sorry for the confusion!

Ward Systems (NeuroShell) calls the Val set, the Test set and the Test set, the Production set.

I forgot to translate into the conventional names.

TomH488

unread,

Sep 24, 2015, 2:08:43 PM9/24/15

to

I should have described this up front but was trying to avoid getting into details:

The behavior is NOT from a single training of a Nnet or the average from 6 trainings with different VAL sets and IW's.

It is a COMPOSITE result from a backtest consisting of 125 of such trainings.

NOTE: The project is a weekly binary prediction [-1,1] of the movement of the stock market.

So the MEAN and STDEV I'm talking about is based on this composite result: 125 averages of 6 trainings w/random VAL sets and IW's.

So we are talking about a composite of single prediction, one week into the future for each of 125 consecutive weeks.

So when I take the MEAN and STDEV of those 125 predictions and compare with the actual value it was trained to (this is a backtest so everything is known), I get this discrepancy.

I suspect that perhaps it is an issue of BIAS: The BIAS's obtained by training on 20 years of historic data WILL NOT match the BIAS associated with 1 week into the future. I suspect this is an issue of nonstationarity.

TomH488

unread,

Sep 29, 2015, 8:12:32 AM9/29/15

to

OBSERVATIONS of the MEAN:

The model predicts high more than it should when the mean of all the backtest trainings is calculated.

If the model was poor, I would expect it to be nothing more than a random number generator with a mean equal to the training data.

But this is not the case - the model over-predicts.

OBSERVATIONS of the StDEV:

The individual trainings for a given date and initial weight seed do have large SD. But when averaged, the SD is significantly reduced.

Actually this is what would be expected if the predictions were random: when you average noise signals, you end up with "nothing" - they cancel each out.

Perhaps I am pulling some true signal out of the noise.

TomH488

unread,

Oct 13, 2015, 1:06:21 PM10/13/15

to

On Wednesday, September 23, 2015 at 5:27:54 AM UTC-4, Greg Heath (alumni.brown.edu) wrote:

There is a single output for each training.

To get any kind of statistic, I'd have to

1) look at the composite result (the 125 single outputs from a BackTest of 125 weeks) or

2) look at a NeuroShell 2 Test Set for a given training. (usually called the Validation Set EVERYWHERE ELSE! - that is SO irritating.) For a BackTest there are (125 weeks) x (6 initial weight seeds) = 750 trainings and hence 750 Test Sets.

NOTE: I think that training beyond the min Test Set error would yield a better output average, however, actual output error would increase. From what I understand about Regularization, it is performed individually on every weight, which should be much better than early stopping.

(Early Stopping will result in "undertrained" weights and some saturated ones. Training more will simply saturate more which is again what is NOT needed. With NS2 not having regularization, perhaps MLP Backprop should be abandoned. I am going to look at PNN and GRNN again to see if better results can be obtained. NOTE: RBF is also not available in NS2.}