Best number of hidden neurons

lm...@wanadoo.fr

unread,

Feb 7, 2009, 4:45:30 PM2/7/09

to

Hi everyone,

Thank you for reading my post.

Let me tell you about the "iris problem".

- We consider 150 iris flowers
described by four numerical variables X1, X2, X3 and X4.

- Each iris belongs to a certain class:
setosa (C1), versicolor (C2) or virginica (C3)
depending on the four variables above.

- Let us say that the first 130 iris will be used as
the training set (T) for the MLP we want to build...

- and the 20 remaining iris will be used for the
validation set (V).

What we want to do is build a MLP (Multi-layer perceptron) which
can classify a new iris (for which the class is unknown).

Suppose, as a start, that the MLP we want to build contains:
- 4 input neurons on its input layer (for each variable X1, X2, X3 and
X4),
- 3 output neurons on its output layer (for each class C1, C2 and C3).

--------------------------------------------------------------------------------------------------------
The problem is that we are looking for the best number of hidden
neurons
considering that the network contains only one hidden layer.
--------------------------------------------------------------------------------------------------------

Do you think that the following approach is correct and if not, can
you
advise me:

1.1 build a MLP with 3 hidden neurons,
1.2 perform 10 training phases
and remember each one of the 10 MLP you got (MLP_3_SET)
and the performances of each MLP (MLP_PERF_3_SET),

2.1 build a MLP with 4 hidden neurons,
2.2 perform 10 training phases
and remember each one of the 10 MLP you got (MLP_4_SET)
and the performances of each MLP (MLP_PERF_4_SET),

...

8.1. build a MLP with 10 hidden neurons,
8.2. perform 10 training phases
and remember each one of the 10 MLP you got (MLP_10_SET)
and the performances of each MLP (MLP_PERF_10_SET).

At the end of the experience you have 8 * 10 MLP
(MLP_3_SET union MLP_4_SET union ... union MLP_10_SET)
and their corresponding performances
(MLP_PERF_3_SET union MLP_PERF_4_SET union ... union MLP_PERF_10_SET)

Select the MLP (M) for which the performances are the best
and get its number of hidden neurons (k).

Conclude that, for that specific problem (the "iris problem")
and that specific type of MLP (one input layer with four neurons, one
hidden layer and one output layer with three neurons) the best number
of hidden neurons is k.

Please tell me what is your feeling about it.
Thanks in advance for your help,
--
Lmhelp

Stephen Wolstenholme

unread,

Feb 7, 2009, 6:37:40 PM2/7/09

to

The method you suggest is similar to the method I use in EasyNN-plus.
The number of hidden nodes starts at zero and then a node is added for
each test of learning for ten cycles (epochs). As nodes are added the
error reduces. Eventually the error increases when a node is added.
This stops the process and the previous network is selected. That
results in five hidden nodes for the Iris example and leaning
completes in 833 cycles. It is not a perfect method because if the
number of hidden nodes is found using manual trial and error only
three hidden nodes are needed and learning is complete in 682 cycles.
It's easy to manually find the optimum number of hidden nodes with
simple example like Iris but more complex examples would take hours
using a manual process.

Steve

--
Neural Planner Software Ltd http://www.NPSL1.com

Neural network applications for Windows

Tomasso

unread,

Feb 7, 2009, 9:23:28 PM2/7/09

to

The iris data set is too small for your task, and MLPs don't give the strongest results. Outlier issues defeat them.

<lm...@wanadoo.fr> wrote in message news:9f341b4a-4776-4663...@m40g2000yqh.googlegroups.com...

Greg Heath

unread,

Feb 8, 2009, 2:36:49 PM2/8/09

to

On Feb 7, 4:45 pm, lm...@wanadoo.fr wrote:
> Hi everyone,
>
> Thank you for reading my post.
>
> Let me tell you about the "iris problem".

-----SNIP

> - Let us say that the first 130 iris will be used as
> the training set (T) for the MLP we want to build...
>
> - and the 20 remaining iris will be used for the
> validation set (V).

There are 50 cases per class. It might be better to
use stratified (class balanced) random sampling by first
partitioning each class into design, validation and test
subsets. The class subsets can then be combined to
obtain the mixture training, validation and test sets.

> What we want to do is build a MLP (Multi-layer perceptron) which
> can classify a new iris (for which the class is unknown).
>
> Suppose, as a start, that the MLP we want to build contains:
> - 4 input neurons on its input layer (for each variable X1, X2, X3 and
> X4),
> - 3 output neurons on its output layer (for each class C1, C2 and C3).

With binary training targets defined by rows or columns of the unit
matrix.

> --------------------------------------------------------------------------
> The problem is that we are looking for the best number of hidden
> neurons > considering that the network contains only one hidden layer.
> -------------------------------------------------------------------------
>

> Do you think that the following approach is correct and if not, can
> you advise me:

N = Ndes + Ntst
Ndes = Ntrn + Nval

You have no test set (Ntst = 0) . All of your data is used for design
(training + validation). In order to obtain an unbiased estimate of
generalization error ( i.e., error on nondesign data) you can do one
of two things.

1. After H and other learning parameters are determined using
repeated trials of the design set, randomly repartition the data
into training and test sets (Nval = 0). Use repeated trials (via
multiple weight initializations) of the new training and test set to
obtain unbiased estimates of the average and standard deviation
of the generalization error.

2. Originally partition the data into design(training+ validation)
and
test sets. After H and other learning parameters are determined
using repeated trials of the design set, combine the training set
and validation set to form a new training set (optional). Use
repeated
trials (via multiple weight initializations) of the (new) training
and
test set to obtain unbiased estimates of the average and standard
deviation of the generalization error.

Ntst and Nval are chosen large enough to obtain sufficiently
precise error rate estimates. If you assume that the classification
errors are binomially distributed, the standard deviation of the
test set error rate is

std(etst) = sqrt(etst*(1-etst)/Ntst) <= 0.5/sqrt(Ntst).

Similarly for eval. Nval and Ntst can be chosen accordingly.
However, Nval doesn't have to be as large as Ntst to determine
sufficiently good parameters.

Ntrn should be chosen suffciently large to obtain accurate weight
estimates. This is discussed in many previous posts. Search
in Google groups using

greg-heath Neq Nw.

If N is not sufficiently large to satisfy all of the constraints,
either lower your requirements (not recommended until
other alternatives are exhausted) or consider

1. f-fold crossvalidation (e.g., 10-fold XVAL)
2. Overtraining mitigation (e.g., weight decay; See the
FAQ re overfitting)

> 1.1 build a MLP with 3 hidden neurons,
> 1.2 perform 10 training phases
> and remember each one of the 10 MLP you got (MLP_3_SET)
> and the performances of each MLP (MLP_PERF_3_SET),

...SNIP

> Select the MLP (M) for which the performances are the best
> and get its number of hidden neurons (k).

How do you define best?

> Conclude that, for that specific problem (the "iris problem")
> and that specific type of MLP (one input layer with four neurons, one
> hidden layer and one output layer with three neurons) the best number
> of hidden neurons is k.
>
> Please tell me what is your feeling about it.

OK for a small problem. However, for larger problems a binary
search for 1 <= Hopt <= Hmax(Neq/Nw) is much more efficient.

In either case it is always good to start with a linear classifier
(H = 0) as a baseline.

I like to plot avg +/- stdev/sqrt(Ntrials) to detremine when I have
a sufficient number of trials.

Hope this helps.

Greg

Greg Heath

unread,

Feb 8, 2009, 2:59:23 PM2/8/09

to

On Feb 7, 6:37 pm, Stephen Wolstenholme <st...@tropheus.demon.co.uk>
wrote:

> The method you suggest is similar to the method I use in EasyNN-plus.
> The number of hidden nodes starts at zero and then a node is added for
> each test of learning for ten cycles (epochs).

Murali Menon has shown that when adding nodes it is best to start with
a
rerandomization of hidden weights.

> As nodes are added the
> error reduces. Eventually the error increases when a node is added.
> This stops the process and the previous network is selected.

I have often found that it is wise to go a few steps further after
the
first increase. Sometimes the increase is sporadic and further
training
will lead to additional decreases.

> That
> results in five hidden nodes for the Iris example and leaning
> completes in 833 cycles. It is not a perfect method because if the
> number of hidden nodes is found using manual trial and error only
> three hidden nodes are needed and learning is complete in 682 cycles.
> It's easy to manually find the optimum number of hidden nodes with
> simple example like Iris but more complex examples would take hours
> using a manual process.

I typically use three or more trials. For each trial I plot error
vs H and superimpose the Ntrial plots.

If I use a binary search for H, when err(2*H) > err(H) I check
err(fix(1.5*H)) and err(fix(2.5*H)) to corroborate overtraining.

Hope this helps.

Greg

Greg Heath

unread,

Feb 8, 2009, 3:18:01 PM2/8/09

to

On Feb 7, 9:23 pm, "Tomasso" <Toma...@blank.blank> wrote:
> The iris data set is too small for your task, and MLPs don't
> give the strongest results. Outlier issues defeat them.

It is always wise to obtain summary statistics and visualization
of the classes before beginning a design. Class-conditional
PCA and clustering are very useful. Outliers detection is usually
not difficut. However, what to do with outliers can be very, very
frustrating. I toss or correct obvious errors. What to do with the
others is problem dependent.

I, too, have found that RBF or EBF models deal with outliers
better than MLPs.

Hope this helps.

Greg

Greg Heath

unread,

Feb 8, 2009, 3:35:41 PM2/8/09

to

On Feb 8, 2:59 pm, Greg Heath <he...@alumni.brown.edu> wrote:
> On Feb 7, 6:37 pm, Stephen Wolstenholme <st...@tropheus.demon.co.uk>
> wrote:
>
> > The method you suggest is similar to the method I use in EasyNN-plus.
> > The number of hidden nodes starts at zero and then a node is added for
> > each test of learning for ten cycles (epochs).
>
> Murali Menon has shown that when adding nodes it is best to start with
> a rerandomization of hidden weights.

a reinitalization of all weights.

Gavin Cawley

unread,

Feb 8, 2009, 4:39:09 PM2/8/09

to

AFAICS this procedure will not give an unbiased estimate, only an
estimate that is likely to be less biased than using the validation
set error. In each repetition, some information about the test set
has leaked into the design of the classifier through the prior
selection of H and "other learning parameters". Sometimes the
remaining bias will be small, but in some situations the bias can be
surprisingly large. Small datasets (such as iris) and large numbers
of "H and other parameters" are likely to make the bias large, big
datasets and small numbers of parameters and there is less of a
problem. A similar situation ocurrs in choosing scaling parameters
for RBF or kernel models (see my JMLR paper at
http://jmlr.csail.mit.edu/papers/v8/cawley07a.html), where if there
are a large number of hyper-parameters you can easily over-fit the
selection criterion. A similar problem also can arise if you use
feature selection based on the design set and then re-partition to get
an estimate of generalisation performance (see http://www.pnas.org/content/99/10/6562.abstract
for details).

The bottom line is that the estimator will only be unbiased if the
test data has not been used in ANY way to set optimise ANY parameter
of the model, including choice of architecture. The safe thing to do
is nested cross-validation, or repeated resampling into training/
validation/test sets, where the validation set is used for choosing H,
early stopping, selection of weight decay parameters etc and then use
the ensemble of the models from each repetition to make predictions
(taking advantage of the additional protection against overfitting
this provides). This approach proved quite successful for the IJCNN
performance prediction challenge.

HTH

Gavin

P.S. for small datasets, such as this kernel methods are much easier
to fit from a practical perspective as there is no problem with local
minima, and leave-one-out cross-validation is essentially for free, so
nested cross-validation schemes are very straightforward and
computationally fairly cheap. However the no-free-lunch theorems say
they wont necessarily work any better ;o)

lm...@wanadoo.fr

unread,

Feb 8, 2009, 5:33:28 PM2/8/09

to

Thank you for all your answers.

---- To Stephen Wolstenholme:

Can you tell me what is the value you use to initialize
your number of cycles? 1000?

---- To Tomasso:

Ok.

---- To Greg Heath:

Sorry, not so easy for me to follow you :)

--
What is H?

--
> Lmhelp: Select the MLP (M) for which the performances are the best and get its number of hidden neurons (k).
> Greg Heath: How do you define best?

To me, the best MLP is the one which:
- given an input iris = (x1, x2, x3, x4)
- belonging "in reality" to the class Cx
- returns an output in {C1, C2, C3, C4}
which is "as frequently as possible" equal to Cx.
As the number of input and output neurons doesn't
change, what we look for is an optimal number of
hidden neurons.

--
I don't know what you mean about "binary search".

Thanks,

--
Lmhelp

lm...@wanadoo.fr

unread,

Feb 8, 2009, 5:33:41 PM2/8/09

to

Stephen Wolstenholme

unread,

Feb 8, 2009, 5:57:50 PM2/8/09

to

On Sun, 8 Feb 2009 12:35:41 -0800 (PST), Greg Heath
<he...@alumni.brown.edu> wrote:

>On Feb 8, 2:59 pm, Greg Heath <he...@alumni.brown.edu> wrote:
>> On Feb 7, 6:37 pm, Stephen Wolstenholme <st...@tropheus.demon.co.uk>
>> wrote:
>>
>> > The method you suggest is similar to the method I use in EasyNN-plus.
>> > The number of hidden nodes starts at zero and then a node is added for
>> > each test of learning for ten cycles (epochs).
>>
>> Murali Menon has shown that when adding nodes it is best to start with
>> a rerandomization of hidden weights.
>
>a reinitalization of all weights.

That's what I do with every change in the number of hidden node.

Stephen Wolstenholme

unread,

Feb 8, 2009, 6:35:08 PM2/8/09

to

On Sun, 8 Feb 2009 14:33:28 -0800 (PST), lm...@wanadoo.fr wrote:

>---- To Stephen Wolstenholme:
>
>Can you tell me what is the value you use to initialize
>your number of cycles? 1000?

The number of cycles is adjustable from 10 upwards. I find that 10
works well and using more does not usually make any difference.

Greg Heath

unread,

Feb 10, 2009, 8:01:35 AM2/10/09

to

Right. I neglected to put in the caveat: given H and the learning
parameters that you have already determined, using the validation
set to estimate generalization error can result in a highly biased
result. To mitigate this, use one or more random repartitions
and with each repartition use multiple weight initializations.

> In each repetition, some information about the test set
> has leaked into the design of the classifier through the prior
> selection of H and "other learning parameters". Sometimes the
> remaining bias will be small, but in some situations the bias can be
> surprisingly large.

Bias primarily depends on the size of the training set. In previous
posts I have given a few heuristic guidelines for determining if Ntrn
is sufficiently large for training to convergence. If those
guidelines
are used, I would be very surprised if a large bias was obtained.

As I stated previously, if N is not large enough to yield
sufficiently
large values of Ntrn and Nval, then crossvalidation and/or
overtraining mitigation should be considered.

> Small datasets (such as iris) and large numbers
> of "H and other parameters" are likely to make the bias large, big
> datasets and small numbers of parameters and there is less of a
> problem.

That is why I suggested searching Google Groups with

greg-heath Neq Nw

> A similar situation ocurrs in choosing scaling parameters
> for RBF or kernel models (see my JMLR paper at
> http://jmlr.csail.mit.edu/papers/v8/cawley07a.html), where if there
> are a large number of hyper-parameters you can easily over-fit the
> selection criterion. A similar problem also can arise if you use
> feature selection based on the design set and then re-partition to get
> an estimate of generalisation performance (see
> http://www.pnas.org/content/99/10/6562.abstract
> for details).
>
> The bottom line is that the estimator will only be unbiased if the
> test data has not been used in ANY way to set optimise ANY parameter
> of the model, including choice of architecture.

PROVIDED the training set is sufficiently large to obtain
accurate weight estimates.

> The safe thing to do

> is nested cross-validation, or repeated resampling into training/
> validation/test sets, where the validation set is used for choosing H,
> early stopping, selection of weight decay parameters etc and then use
> the ensemble of the models from each repetition to make predictions
> (taking advantage of the additional protection against overfitting
> this provides).

This is the approach that I recommend for serious work.

However, the point I was addressing in (1) was what to
do if you have already made the mistake of using all of
the data to determine topology and corresponding
learning parameters.

If N can provide sufficiently large values for Ntrn and Ntst,
I doubt if large biases will result from random Ntrn/Ntst
repartitions.

> This approach proved quite successful for the IJCNN
> performance prediction challenge.

I'm not surprised.

> HTH

It does. My original advice was not sufficiently clear.

Thanks.

> P.S. for small datasets, such as this kernel methods are much easier
> to fit from a practical perspective as there is no problem with local
> minima, and leave-one-out cross-validation is essentially for free, so
> nested cross-validation schemes are very straightforward and
> computationally fairly cheap. However the no-free-lunch theorems say
> they wont necessarily work any better ;o)

P.P.S I'll have to check your references later ( ...so little
time ...)

Greg

Greg Heath

unread,

Feb 10, 2009, 8:47:32 AM2/10/09

to

On Feb 8, 5:33 pm, lm...@wanadoo.fr wrote:
> Thank you for all your answers.

-----SNIP

> ---- To Greg Heath:
>
> Sorry, not so easy for me to follow you :)

Sorry, I have written about this so many times
in previous posts, I do tend to skip details
that would help. Searching in Google Groups
using greg-heath as one of the keywords will
usually clarify almost anything I write.

Otherwise, just do as you are doing now ...
ask.

> What is H?

Number of hidden nodes

> > Lmhelp: Select the MLP (M) for which the performances are the best and get its number of hidden neurons (k).
> > Greg Heath: How do you define best?
>
> To me, the best MLP is the one which:
> - given an input iris = (x1, x2, x3, x4)
> - belonging "in reality" to the class Cx
> - returns an output in {C1, C2, C3, C4}
> which is "as frequently as possible" equal to Cx.

There are only three classes.

> As the number of input and output neurons doesn't
> change, what we look for is an optimal number of
> hidden neurons.

You missed my point. For each H candidate you will
have multiple runs based on multiple weight
initializations. Each run will have three
class-conditional error rates and one mixture error
rate. Therefore, when the smoke clears you will have
four distributions of error rates for each H candidate.
Given all of this, what summary statistic do you use to
determine that H2 is better than H1?

How would this change if the classes were of different
sizes (e.g., 150 = 25 + 50 + 75)?

> I don't know what you mean about "binary search".

http://mathworld.wolfram.com/BinarySearch.html

en.wikipedia.org/wiki/Binary_search

Start with Hmin = 1 and Hmax(Neq/Nw) as explained
in previous posts. Search with

greg-heath Neq Nw.

If you are persistent,

greg-heath binary-search

might help.

Hope this helps

Greg

Gavin Cawley

unread,

Feb 10, 2009, 2:24:40 PM2/10/09

to

I think we must be talking at cross-purposes (plus ca change! ;o) as
more random repartitions will reduce the variance of the estimator, it
will have no effect on the bias.

> > In each repetition, some information about the test set
> > has leaked into the design of the classifier through the prior
> > selection of H and "other learning parameters". Sometimes the
> > remaining bias will be small, but in some situations the bias can be
> > surprisingly large.
>
> Bias primarily depends on the size of the training set.

No, if a method for generating a performance estimate is statistically
unbiased, it will give unbiased estimate regardless of the size of the
dataset. The size of the dataset will however affect the variance of
the estimator.

> In previous
> posts I have given a few heuristic guidelines for determining if Ntrn
> is sufficiently large for training to convergence. If those
> guidelines
> are used, I would be very surprised if a large bias was obtained.
>
> As I stated previously, if N is not large enough to yield
> sufficiently
> large values of Ntrn and Nval, then crossvalidation and/or
> overtraining mitigation should be considered.

yes, but that is separate to the issue of whether the estimate of
generalisation performance is unbiased.

> > Small datasets (such as iris) and large numbers
> > of "H and other parameters" are likely to make the bias large, big
> > datasets and small numbers of parameters and there is less of a
> > problem.
>
> That is why I suggested searching Google Groups with
>
> greg-heath Neq Nw
>
> > A similar situation ocurrs in choosing scaling parameters
> > for RBF or kernel models (see my JMLR paper at
> >http://jmlr.csail.mit.edu/papers/v8/cawley07a.html), where if there
> > are a large number of hyper-parameters you can easily over-fit the
> > selection criterion. A similar problem also can arise if you use
> > feature selection based on the design set and then re-partition to get
> > an estimate of generalisation performance (see
> > http://www.pnas.org/content/99/10/6562.abstract
> > for details).
>
> > The bottom line is that the estimator will only be unbiased if the
> > test data has not been used in ANY way to set optimise ANY parameter
> > of the model, including choice of architecture.
>
> PROVIDED the training set is sufficiently large to obtain
> accurate weight estimates.

no, the statement I made above holds true whether or not the training
set is large enough to obtain accurate weight estimates. The problem
is to do with the possibility of over-fitting the model selection
criterion (used to select H etc) not overfitting the training
criterion. This is a subtle issue.

> > The safe thing to do
> > is nested cross-validation, or repeated resampling into training/
> > validation/test sets, where the validation set is used for choosing H,
> > early stopping, selection of weight decay parameters etc and then use
> > the ensemble of the models from each repetition to make predictions
> > (taking advantage of the additional protection against overfitting
> > this provides).
>
> This is the approach that I recommend for serious work.
>
> However, the point I was addressing in (1) was what to
> do if you have already made the mistake of using all of
> the data to determine topology and corresponding
> learning parameters.

IMHO it is better to scrap the model you have and go back and perform
the experiment again properly, "better to get a good answer slowly
than a bad answer quickly" is my maxim ;o).

> If N can provide sufficiently large values for Ntrn and Ntst,
> I doubt if large biases will result from random Ntrn/Ntst
> repartitions.

if you said variances rather than biases, I would agree with you.

> > This approach proved quite successful for the IJCNN
> > performance prediction challenge.
>
> I'm not surprised.

my models were not the most accurate, but IIRC my performance
estimates were. Nested cross-validation compared to less proper
estimates made a lot of difference.

> > HTH
>
> It does. My original advice was not sufficiently clear.
>
> Thanks.
>
> > P.S. for small datasets, such as this kernel methods are much easier
> > to fit from a practical perspective as there is no problem with local
> > minima, and leave-one-out cross-validation is essentially for free, so
> > nested cross-validation schemes are very straightforward and
> > computationally fairly cheap. However the no-free-lunch theorems say
> > they wont necessarily work any better ;o)
>
> P.P.S I'll have to check your references later ( ...so little
> time ...)

The one in PNAS is especially relevant (I think it is a classic that
all machine learning and NN people ought to read), mine is rather more
tangential as it explains the problem that means a bias is possible in
(1), but it isn't specifically about bias in performance estimates per
se. There is a bit of discussion in my conference paper from the
IJCNN competition though.

lm...@wanadoo.fr

unread,

Feb 10, 2009, 6:51:00 PM2/10/09

to

Ok, thank you for your answers.
I'll give myself a little time to think about all the interesting
things you wrote.

Greg Heath

unread,

Feb 12, 2009, 10:57:31 AM2/12/09

to

On Feb 10, 2:24 pm, Gavin Cawley <GavinCaw...@googlemail.com> wrote:
> On 10 Feb, 13:01, Greg Heath <he...@alumni.brown.edu> wrote:
> > On Feb 8, 4:39 pm, Gavin Cawley <GavinCaw...@googlemail.com> wrote:
> > > On 8 Feb, 19:36, Greg Heath <he...@alumni.brown.edu> wrote:
> > > > On Feb 7, 4:45 pm, lm...@wanadoo.fr wrote:

-----SNIP

The bias will not be removed. However, I thought that given the
original mistake, this could reduce the bias in addition to reducing
the variance.

Once you have made the mistake of using all of your data for design
via repeated applications of the same training and validation sets,
using the same Ntrn/Nval partition for testing will obviously result
in a
large bias.

How, now, to obtain the best generalization error estimate?
Since you already know H and the learning parameters, the
horse is out of the barn. Going back and using a three-way
Ntrn/Nval/Ntst partition doesn't make any sense. It also doesn't
make any sense to just use one more two way partition (Ntrn/Ntst)
with multiple weight intializations.

Therefore, the best way to mitigate the original mistake, is to use
multiple random Ntrn/Ntst repartitions and with each repartition
use multiple weight initializations.

My claim is that, given the original mistake, this is one way to
try to obtain the best possible generalization estimate.

Therefore, the bias is mitigated, not necessarily eliminated.

> > > In each repetition, some information about the test set
> > > has leaked into the design of the classifier through the prior
> > > selection of H and "other learning parameters". Sometimes the
> > > remaining bias will be small, but in some situations the bias can be
> > > surprisingly large.
>
> > Bias primarily depends on the size of the training set.
>
> No, if a method for generating a performance estimate is statistically
> unbiased, it will give unbiased estimate regardless of the size of the
> dataset. The size of the dataset will however affect the variance of
> the estimator.

Fukunaga (1970?,199?) shows that bias depends on
the size of Ntrn and variance depends on the size of Ntst.

I guess the best way to prove this to yourself would be to begin
with Ntrn and Ntst at sufficiently large values. Then freeze one
and decrease the other.

> > In previous
> > posts I have given a few heuristic guidelines for determining if
> > Ntrn is sufficiently large for training to convergence. If those
> > guidelines are used, I would be very surprised if a large bias
> > was obtained.
>
> > As I stated previously, if N is not large enough to yield
> > sufficiently large values of Ntrn and Nval, then crossvalidation
> > and/or overtraining mitigation should be considered.
>
> yes, but that is separate to the issue of whether the estimate of
> generalisation performance is unbiased.

Since Bias depends on Ntrn, I disagree.

> > > Small datasets (such as iris) and large numbers
> > > of "H and other parameters" are likely to make the bias large, big
> > > datasets and small numbers of parameters and there is less of a
> > > problem.

That is what I said: Bias depends on Ntrn.

Are you contradicting yourself or are we miscommunicating?

-----SNIP

>
> > > The bottom line is that the estimator will only be unbiased if the
> > > test data has not been used in ANY way to set optimise ANY parameter
> > > of the model, including choice of architecture.
>
> > PROVIDED the training set is sufficiently large to obtain
> > accurate weight estimates.
>
> no, the statement I made above holds true whether or not the training
> set is large enough to obtain accurate weight estimates. The problem
> is to do with the possibility of over-fitting the model selection
> criterion (used to select H etc) not overfitting the training
> criterion. This is a subtle issue.

Again, the difference over whether bias depends on Ntrn.

-----SNIP

> > However, the point I was addressing in (1) was what to
> > do if you have already made the mistake of using all of
> > the data to determine topology and corresponding
> > learning parameters.
>
> IMHO it is better to scrap the model you have and go back and perform
> the experiment again properly, "better to get a good answer slowly
> than a bad answer quickly" is my maxim ;o).

I agree. That was the second option I proposed. However,
I failed to make a statement that it would be better to start
over again. I was in the mode of how to mitigate the mistake.

> > If N can provide sufficiently large values for Ntrn and Ntst,
> > I doubt if large biases will result from random Ntrn/Ntst
> > repartitions.
>
> if you said variances rather than biases, I would agree with you.
>
> > > This approach proved quite successful for the IJCNN
> > > performance prediction challenge.
>
> > I'm not surprised.
>
> my models were not the most accurate, but IIRC my performance
> estimates were. Nested cross-validation compared to less proper
> estimates made a lot of difference.

I'm glad you made that point. The OP was only concerned with getting
the right topology. I addressed given that topology, how to get the
best generalization error estimate.

Unfortunately, I have no longer have books (e.g., Fukunaga...).
In addition, family matters are keeping me away from home and
my MATLAB software. Otherwise I would try to demonstrate the
dependence of bias on Ntrn.

Hope this helps.

Greg

Gavin Cawley

unread,

Feb 13, 2009, 7:27:37 AM2/13/09

to

The point I was making was that method 1 doesn't give an unbiased
estimate as you initially suggested, just a less biased one. We now
seem in agreement. Whether the remaining bias is negligible depends
on the specifics of the example. It is important not to say the
estimator is unbiased as it suggests that it is an unconditionally
safe procedure, which could lead to a pitfall for the uwary.

> > > > In each repetition, some information about the test set
> > > > has leaked into the design of the classifier through the prior
> > > > selection of H and "other learning parameters". Sometimes the
> > > > remaining bias will be small, but in some situations the bias can be
> > > > surprisingly large.
>
> > > Bias primarily depends on the size of the training set.
>
> > No, if a method for generating a performance estimate is statistically
> > unbiased, it will give unbiased estimate regardless of the size of the
> > dataset. The size of the dataset will however affect the variance of
> > the estimator.
>
> Fukunaga (1970?,199?) shows that bias depends on
> the size of Ntrn and variance depends on the size of Ntst.
>
> I guess the best way to prove this to yourself would be to begin
> with Ntrn and Ntst at sufficiently large values. Then freeze one
> and decrease the other.

I don't have Fukunaga to hand either, but I will check. I think it
may depend on what the estimator claims to be. For instance LOOCV is
an almost unbiased estimate of the performance of the classifier
trained on all of the available data, but IIRC it is an unbiased
estimator of performance on a dataset one example smaller (hence the
"almost unbiased").

The point I was making is that if you optimise the hyper-parameters
(e.g. choice of H) on the whole dataset then even an (almost) unbiased
method such as LOOCV will no longer be (almost) unbiased if you use
the same hyper-parameter settings in each fold. The reason is that
you are not repeating in each fold the whole procedure used to fit the
model (i.e. model selection and model fitting). The size of that bias
can be surprising large and unless you do a direct comparison, the you
will be unaware of its magnitude.

In other words (1) would make even a completely unbiased estimator
become biased via selection bias, whether the particular (resampling)
estimator was unbiased to begin with is secondary to the point I
wanted to make.

> > > In previous
> > > posts I have given a few heuristic guidelines for determining if
> > > Ntrn is sufficiently large for training to convergence. If those
> > > guidelines are used, I would be very surprised if a large bias
> > > was obtained.
>
> > > As I stated previously, if N is not large enough to yield
> > > sufficiently large values of Ntrn and Nval, then crossvalidation
> > > and/or overtraining mitigation should be considered.
>
> > yes, but that is separate to the issue of whether the estimate of
> > generalisation performance is unbiased.
>
> Since Bias depends on Ntrn, I disagree.
>
> > > > Small datasets (such as iris) and large numbers
> > > > of "H and other parameters" are likely to make the bias large, big
> > > > datasets and small numbers of parameters and there is less of a
> > > > problem.
>
> That is what I said: Bias depends on Ntrn.
>
> Are you contradicting yourself or are we miscommunicating?

the latter (see above)

That is at the heart of the problem, you fundamentally can no longer
get an unbiased performance estimator for that topology as the entire
dataset has been used to select the topology, this is a form of
selection bias. This is directly analogeous to the feature selection
bias discussed in the PNAS paper I mentioned.

> Unfortunately, I have no longer have books (e.g., Fukunaga...).
> In addition, family matters are keeping me away from home and
> my MATLAB software. Otherwise I would try to demonstrate the
> dependence of bias on Ntrn.

No problem, as I mentioned earlier I think I have seen the source of
the miscommunication and hope I have explained my problem with (1)
rather better now. Basically I wanted to apply a caveat to some vey
reasonable advice that had been unintentionally over-stated as being
"unbiased".

I hope the family matters can be resolved in a satisfactory (or
better) manner!

Greg Heath

unread,

Feb 13, 2009, 11:16:59 PM2/13/09

to

Yes, I did not make myself clear. Once all of the data is repeatedly
used to determine the topology and learning parameters, any result
will be biased. The only thing left is to mitigate the effect of the
bias
and try to obtain the most accurate and precise generalization
estimate as possible. (which by the way, did not seem to be the
goal of the OP).

> We now
> seem in agreement. Whether the remaining bias is negligible depends
> on the specifics of the example. It is important not to say the
> estimator is unbiased as it suggests that it is an unconditionally
> safe procedure, which could lead to a pitfall for the uwary.
>
> > > > > In each repetition, some information about the test set
> > > > > has leaked into the design of the classifier through the prior
> > > > > selection of H and "other learning parameters". Sometimes the
> > > > > remaining bias will be small, but in some situations the bias can be
> > > > > surprisingly large.
>
> > > > Bias primarily depends on the size of the training set.
>
> > > No, if a method for generating a performance estimate is statistically
> > > unbiased, it will give unbiased estimate regardless of the size of the
> > > dataset. The size of the dataset will however affect the variance of
> > > the estimator.
>
> > Fukunaga (1970?,199?) shows that bias depends on
> > the size of Ntrn and variance depends on the size of Ntst.
>
> > I guess the best way to prove this to yourself would be to begin
> > with Ntrn and Ntst at sufficiently large values. Then freeze one
> > and decrease the other.
>
> I don't have Fukunaga to hand either, but I will check. I think it
> may depend on what the estimator claims to be. For instance LOOCV is
> an almost unbiased estimate of the performance of the classifier
> trained on all of the available data,

Agree.

> but IIRC it is an unbiased
> estimator of performance on a dataset one example smaller (hence the
> "almost unbiased").

Are you referring to Jacknife estimation?

> The point I was making is that if you optimise the hyper-parameters
> (e.g. choice of H) on the whole dataset then even an (almost) unbiased
> method such as LOOCV will no longer be (almost) unbiased if you use
> the same hyper-parameter settings in each fold. The reason is that
> you are not repeating in each fold the whole procedure used to fit the
> model (i.e. model selection and model fitting).

Won't using separately tailored parameters for each fold increase the
bias?

Thanks.

Greg

Gavin Cawley

unread,

Feb 14, 2009, 11:18:45 AM2/14/09

to

On 14 Feb, 04:16, Greg Heath <he...@alumni.brown.edu> wrote:
> On Feb 13, 7:27 am, Gavin Cawley <GavinCaw...@googlemail.com> wrote:

[snip snip snip]

> > I don't have Fukunaga to hand either, but I will check. I think it
> > may depend on what the estimator claims to be. For instance LOOCV is
> > an almost unbiased estimate of the performance of the classifier
> > trained on all of the available data,
>
> Agree.
>
> > but IIRC it is an unbiased
> > estimator of performance on a dataset one example smaller (hence the
> > "almost unbiased").
>
> Are you referring to Jacknife estimation?

it is the same sort of mechanism, yes.

> > The point I was making is that if you optimise the hyper-parameters
> > (e.g. choice of H) on the whole dataset then even an (almost) unbiased
> > method such as LOOCV will no longer be (almost) unbiased if you use
> > the same hyper-parameter settings in each fold. The reason is that
> > you are not repeating in each fold the whole procedure used to fit the
> > model (i.e. model selection and model fitting).
>
> Won't using separately tailored parameters for each fold increase the
> bias?

No, because it is a more realistic representation of the full
procedure used to fit the model, so if you don't tune the parameters
separately in each fold you are not estimating the performance of the
method used to fit the initial model, but a slightly different one
with some additional expert knowledge (and hence it will generally
give overly optimisitic performance estimates).

Another way to think of is is if you use parameters tailored to the
whole dataset they are as a result to some extent tailored to the
patterns forming the test set in every fold of the cross-validation
procedure, in which case you would expect an optimistic performance
estimate.

Greg Heath

unread,

Feb 15, 2009, 6:05:07 AM2/15/09

to

On Feb 14, 11:18 am, Gavin Cawley <GavinCaw...@googlemail.com> wrote:

> On 14 Feb, 04:16, Greg Heath <he...@alumni.brown.edu> wrote:
> > On Feb 13, 7:27 am, Gavin Cawley <GavinCaw...@googlemail.com> wrote:

-----SNIP

> > > The point I was making is that if you optimise the hyper-parameters
> > > (e.g. choice of H) on the whole dataset then even an (almost) unbiased
> > > method such as LOOCV will no longer be (almost) unbiased if you use
> > > the same hyper-parameter settings in each fold. The reason is that
> > > you are not repeating in each fold the whole procedure used to fit the
> > > model (i.e. model selection and model fitting).
>
> > Won't using separately tailored parameters for each fold increase the
> > bias?
>

> No, because it is a more realistic representation of the full
> procedure used to fit the model, so if you don't tune the parameters
> separately in each fold you are not estimating the performance of the
> method used to fit the initial model, but a slightly different one
> with some additional expert knowledge (and hence it will generally
> give overly optimisitic performance estimates).
>
> Another way to think of is is if you use parameters tailored to the
> whole dataset they are as a result to some extent tailored to the
> patterns forming the test set in every fold of the cross-validation
> procedure, in which case you would expect an optimistic performance
> estimate.

No, I would expect more optimistic error rates from "tuning" each
fold model to fit the data in the fold subset than if the "tuning"
were
global.

Hope this helps.

Greg

Gavin Cawley

unread,

Feb 15, 2009, 7:27:00 AM2/15/09

to

The reason it gives less bias is because you will end up over-tuning
the parameters (unless you take effective steps to avoid that - see my
JMLR paper) to the design set in each fold, making the test
performance worse not better. This may seem counter-intuitive, but I
can assure you from personal experience (from the performance
prediction challenge) that it does happen.

Gavin...@googlemail.com

unread,

Mar 6, 2009, 5:45:41 AM3/6/09

to

I have some results in for this question as well, on all of the
datasets I investigated the error rates were more (in some cases
extremely) optimistic for global tuning than they were for separate
tuning in each fold (e.g. nested cross-validation), as predicted. The
differences are roughly commensurate with the differences in
performance between learning algorithms, which shows the bias is of
practical concern!

Greg Heath

unread,

Mar 6, 2009, 7:26:35 AM3/6/09

to

On Mar 6, 5:45 am, "GavinCaw...@googlemail.com"

> practical concern!- Hide quoted text -
>
> - Show quoted text -

You have seem to have several results that are counterintuitive
and not well known. If so, you should think about disseminating them.

Perhaps next week I will have time to review and think about this
some more.

Greg

Gavin...@googlemail.com

unread,

Mar 6, 2009, 7:47:23 AM3/6/09

to

This work was for a paper I am writing, so it will be disseminated
when I have a coherent story and reccommendations. Again, I'd be
happy to send you a draft for comment once I have put it together (I
find explaining things unambiguously can be rather tricky and I'd
value comments from an expert with differing intuition!).

Greg Heath

unread,

Mar 6, 2009, 7:47:56 AM3/6/09

to

P.S. Are you using real-world data sets or simulations?

Greg

Gavin...@googlemail.com

unread,

Mar 6, 2009, 8:34:27 AM3/6/09

to

a mixture of well known benchmark datasets, mostly real-world, but a
few synthetic ones as well. I use Ripley's synthetic benchmark for
most of the initial exposition, but the results section is based on
more meaningful benchmarks (not iris! ;o).

Best number of hidden neurons - Specific example

lm...@wanadoo.fr

Stephen Wolstenholme

Tomasso

Greg Heath

Greg Heath

Greg Heath

Greg Heath

Gavin Cawley

lm...@wanadoo.fr

lm...@wanadoo.fr

Stephen Wolstenholme

Stephen Wolstenholme

Greg Heath

Greg Heath

Gavin Cawley

lm...@wanadoo.fr

Greg Heath

Gavin Cawley

Greg Heath

Gavin Cawley

Greg Heath

Gavin Cawley

Gavin...@googlemail.com

Greg Heath

Gavin...@googlemail.com

Greg Heath

Gavin...@googlemail.com