DOF = N - (p+1) = N - Nw
where Nw is the number of estimated weights?
How is the concept and equation modified for
multiple nonlinear regression?
How are the concepts and equations modified
for multivariate linear regression and
multivariate nonlinear regression?
Please post replies to all three newsgroups.
TIA,
Greg Heath
Degrees of freedom in linear modelling and GLMs (exponential family modelling)
is an outcome of maximising the likelihood function. It happens to be N-(p+1),
but that's because the maths works out that way, not some a priori reasoning
about number of parameters.
In linear models and GLMs, the parameters are extensive. This means that any
change to a parameter has a "same order" effect on most values in the dependent
variable(s) space. Eg, z = ax+by+c, if a is changed to a+1, then z is greatly modified
for all values of z. For neural networks, the transfer function causes the parameter
to be intensive. This means that if you change a parameter then SOME parts of
the dependent variable space experience a "same order" effect, but much of the
space (many predictions) are only slightly affected. For typical FFNs, this amounts
to sensitivity in the neighbourhood of linear sub-manifolds about the fan-in of each
neuron. In RFBs and SVMs, it's about the neighbourhood of RBF centres or
support vectors.
To get a sense of this, build a neural model, set up an evaluation set (eg, 100 points
in the input space), and examine the variation of predictions across this evaluation
set when you change ONE of the weights in the NN. You will observe that a few
points change a lot and many don't change much. Individual weights contribute to
some predictions a lot, and other predictions not much at all.
This is a result of NNs being a composition of logistic models, optimised over an
MSE (or something similar). The composition gives NNs their desirable non-
linearity and flexibility. It also makes weights "work hard" for less than the
whole model space.
This can be summarised as: in linear (and similar) modelling, parameters (weights)
have generalised roles, but in NNs (and similar), weight have specialised roles.
S\For useful discussion, see Ingrassia and Morlini, eg,
www.economia.unict.it/ingrassia/publ/IngrassiaMorliniGfkl06.pdf
To try and relate this to Greg's formula is futile. The "freedom" in an NN depends
in how it came to be constructed, which is about the data and the architecture
and the learning algorithm and the initialisation.
An extreme version of this issue is to try to model a fairly complex problem with
a simple neural architecture and then partition the training data into points well
modelled and points poorly modelled. Train again on the well modelled data
only, and build a second NN on the poorly modelled data. Combine these
models with a boosting (or gating) approach.
I may be able to find some more references, but I'm pretty busy...
Meno.
PS: People inclined to base their "theories of NN design" using terms like "degree
of freedom" need to read a bit more theory of what DOF is about. NN flexibility
is basically a soft version of partitioning a model space.
You may want to consider Generalized Degrees of Freedom,
for which the basic reference is
Ye, J. (1998). On measuring and correcting the effects of data mining
and model selection. Journal of the American Statistical Association,
93, 120–131.
"Degrees of Freedom" have at least two roles:
(i) use in bias adjustment for estimating the error variance;
(ii) in standard distributions used for significance tests.
The modifications required for these two roles are different.
For a general approach to things, which also allows dependence in the modelling errors (which also has an effect), you might see the following paper (by myself)(although degrees of freedom are not explicitly treated as such):
Jones DA (1983) "Statistical anlysis of empirical models fitted by optimisation". Biometrika, 70(1), 67-88.
David Jones
Is it because: My fitted line can use one of the observations (the one
corresponsing to one observation of y and one of x)
and the rest of the observations will be spread around. But one observation
from x, and the correspong observation from y, is not enough to define a
line. Then why I substract only one observation from the total number of
observations? Should I substract two from the total number of observations?
(here by one obervation I mean one from y and one from x)
Their earlier paper in Technometrics looks like a good place to start
(also available from http://www.economia.unict.it/ingrassia/research.htm).
I've only skimmed through it, but it looks like an approach worth
trying in addition to more conventional approaches (at least
conventional within the neural network community).
That is not a definition.
Hope this helps.
Greg
-----SNIP
But how is it defined?
-----SNIP
Hope this helps.
Greg
I was (foolishly?) looking for something like:
In multiple linear regression, the information
(currently undefined) in N = Nw = p+1
independent observations of the form ([x0(=1),
x1, x2, ...xp],y) are sufficient for finding
a unique solution, b, to the linear regression
equation
X*b = Y,
size(X) = [N p+1],
size(b) = [p+1 1],
size(y) = [N 1].
However, if N = Nw + (N-Nw) > Nw, then the
information in Nw observations is sufficient
for obtaining a solution and the remaining
N-Nw observations are "free" for satisfying
auxiliary constraints. Therefore DOF = N-Nw.
So, DOF is a measure of the difference between
the total information available from observations
and the minimal information needed for a unique
solution.
The extension to multivariate linear regression
would be
[x0,x1,...xp,y1,y2,...ym],
X*B = Y,
size(X) = [N p+1]
size(B) = [p+1 m]
size(y) = [N m]
and
DOF = N*m - m*(p+1)
Am I way off base?
Hope this helps.
Greg
Not sure why I didn't search here first,
but
http://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)
is along the line of what I was thinking.
Hope this helps.
Greg
There are two "it"s, neither of which would necessary be defined as being a "degrees of freedom"
For the case of bias correction "it" is related to the trace of the product of two matrices: this may be closest to a "degrees of freedom".
For the case of the distribution of test statistics, the whole distribution may change, going from a chi-squared to the distrib of a weighted sum of squared normals.
David Jones
>On Oct 23, 11:16 am, Greg Heath <he...@alumni.brown.edu> wrote:
>> On Oct 22, 11:13 pm, Greg Heath <he...@alumni.brown.edu> wrote:
[snip]
>> However, if N = Nw + (N-Nw) > Nw, then the
>> information in Nw observations is sufficient
>> for obtaining a solution and the remaining
>> N-Nw observations are "free" for satisfying
>> auxiliary constraints. Therefore DOF = N-Nw.
>>
>> So, DOF is a measure of the difference between
>> the total information available from observations
>> and the minimal information needed for a unique
>> solution.
>>
>> The extension to multivariate linear regression
>> would be
>>
>> [x0,x1,...xp,y1,y2,...ym],
>> X*B = Y,
>> size(X) = [N p+1]
>> size(B) = [p+1 m]
>> size(y) = [N m]
>>
>> and
>>
>> DOF = N*m - m*(p+1)
>
>> Am I way off base?
>
>Not sure why I didn't search here first,
>but
>
>http://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)
>
>is along the line of what I was thinking.
>
That's a good article.
It does not say explicitly about the "multivariate linear regression"
tests, where there are multiple x_i and y_j, but those tests mostly
fall under "effective degrees of freedom," so far as I know.
- The problem is MANOVA, which is canonical correlation, with
multiple roots.
Thus, you have a multiple-testing problem, inherently, until
you specify how you want to weight the roots. And you
have d.f. available in terms of the sample size, which are
'eaten up' effectively by d.f. for x_i and y_j. Like you suggest,
above, the d.f. are used (somewhat) as a product of the two
counts.
What I remember from reading Rao on these models is that there
are exact distributions for the tests in a few circumstances, such
as, d.f.'s of 1 on one side or 2 on both. Beyond that, he described
tests with distributions that were approximated by F, with
approximate d.f.
That was old, but I'm not aware of newer tests for MANOVA.
I don't know at all what is done for multivariate *nonlinear*
regression.
--
Rich Ulrich
My conclusion is that DOF is not something appropriate to define for non-linear
modelling. The LM/GLM and xANOVA families use DOF as a relationship
between data/model structure and as a parameters for probability distribution
(to tell confidence intervals, etc of predictions in general under homoscedastic
assumption).
Non-linear models simply don't work this way. This is why you can have lots
of parameters and get reasonable predictions. A LM or GLM with lots of
parameters is usually a dog. NNs with lots of weights can be quite respectable.
In some cases, looking at the Hessian of the error function helps (eg, see
optimal brain damage).
Meno.
Errr... no, that is clearly not the case, regularisation is very
effective in training LMs and GLMs with lots of parameters, even in
situations with very little data. See for example the results of the
recent IJCNN challenges where a linear LS-SVM (equivalent to ridge
regression) worked well on many datasets, or try simple ridge
regression for micro-array datasets with thousands of features and
maybe a hundred training patterns, works just fine in terms of
predictive ability.
If you regularise, it's not a standard GLM/LM, and doesn't have "thousands" of
parameters. SVMs are elegant ways to avoid the actual problem with high dim
input spaces.
Meno.
over-parameterised neural networks don't give good predictions either
unless you take some steps to limit complexity by early stopping or
regularisation or training with noise etc. There is no fundamental
difference between linear and non-linear models in this respect.
There is. The number of parameters in a good NN is higher.
In any case, when I use NNs, I usually use pruning methods.
Meno.
Nonsense, this can be demonstrated by considering a NN with skip-layer
connections (a very good idea if you use pruning), in which case the
optimal number of weights for NN and linear models is the same for
linear problems. For non-linear problems a comparison of the number
of weights is meaningless as one of the models is evidently mis-
specified.
> In any case, when I use NNs, I usually use pruning methods.
In which case (a) the network isn't over-parameterised and (b) you
have taken additional steps to limit complexity.
BTW, as it happens I also tend to use pruning methods as well, in my
case using a Laplace prior so I get regularisation as well.
Stupid provocateuring cunt! Selectivity in argument is the sin of a hair
splitter with nothing of substance and rampant self-deception to boot
>> In any case, when I use NNs, I usually use pruning methods.
>
> In which case (a) the network isn't over-parameterised and (b) you
> have taken additional steps to limit complexity.
That is rather obvious I would say. With traditional GLMs, variable
selection is well understood. Pruning an NN is a bit more challenging.
> BTW, as it happens I also tend to use pruning methods as well, in my
> case using a Laplace prior so I get regularisation as well.
Good!
Meno.
oh dear, back to the old meno again :-(
No doubt, I'll send the invoice in the morning!
Ultimately, if you play cheeky with the truth, you end up paying for your folly.
OTOH, some of the issues you raised originally are subtle and complex, so
I'm not going to play a game of pretense and experience (on my part). There
may be some useful outcomes.
Meno.
There is also an earlier(2004) reference from the same site.
The full cites for the three are
Ingrassia S. & Morlini I. (2004), On the degrees of
freedom in richly parameterised models, in "J. Antoch
(Ed.), Proceedings of COMPSTAT 2004 Symposium",
Physica-Verlag, 1237-1244.
http://www.economia.unict.it/ingrassia/publ/IngrMorl04.pdf
Ingrassia S. & Morlini I. (2005), Modeling neural
networks from small datasets, Technometrics, 47, n.3,
297-311.
http://www.economia.unict.it/ingrassia/publ/Tech2005v47n3p297-311.pdf
Ingrassia S. & Morlini I. (2007), Equivalent number
of degrees of freedom for neural networks, in "Decker
R., Lenz H.-J.(Eds.), Advances in Data Analysis,
Springer-Verlag, Berlin, 2007, 229-236.
www.economia.unict.it/ingrassia/publ/IngrassiaMorliniGfkl06.pdf
Hope this helps.
Greg
There is an earlier 2004 paper. Full