First, I have read your robust list of what to do.
There are a few procedures that I have questions about:
__________________
YOUR ITEM #4: Use the data matrix correlation coefficient matrix CC=COFFCOEF(Z) to identify:
a. undesirable low corr between y and col of x.
b. undesirable high corr between col of x.
Either one is not too bad a thing to do by hand, but doing both "by eye" is a bit much. I'm working with 186 inputs that I need to cull down.
NOTE: I added 43 World Stock Indices and a unit Lag and many have to go.
Sounds like MATLAB has STEPWISE and STEPWISEFIT to automate this. I was thinking of a selection algorithm but really don't know enough of the mathematics to make a trade between a. and b. when it occurs.
Any suggestions how to wade through this expeditiously?
_________________________
Apparent popularity of using ACF and PACF to determine LAGS for a Forecasting Prediction.
One paper said:
1) use L1...Ln which have significant PACF's, AND 2) Use Li of the 4 top ACF's.
That is clear enough for predicting Price from Price input, however, I have some other issues to deal with:
ISSUE ONE:
I have many other non-price inputs that I include LAGS. Since they are not Price, they should be not-uncorrelated with Output YET STILL not-correlated to each other. So should a candidate LAG be REQUIRED to satisfy both constraints? or if not possible, should the least offensive ones be used or should NO lags be used since none were suitable?
Well, that's confusing enough.
ISSUE TWO:
While I am doing forecasting, I'm also doing pattern recognition - my "stock chart." Although these columns like like Price and dozens of lags, its really a set of Price Pixels.
So all this ACF, PACF, CORR stuff, I get the impression is has not too much to do with pixel input.
NOTE: I was thinking if you were trying to predict the trajectory of a single black pixel image, you might look at a particular pixel output being determined by only a local neighborhood of pixels. Perhaps this would be a "PACF" style simplification.
Nevertheless, I'm not sure what to do about my stock chart columns since they really violate the Price LAG method based on PACF and ACF.
_____________________
I'm really sold on PCA (Principle Coordinate Analysis) but really Discriminant Analysis (also or now called PLS, Partial Least Squares?) due to the Parallel Cigar problem example.
In Modal Analysis of mega-large finite element models, we always did Generalized Dynamic Reduction which found the first n requested Generalized Coordinates upon which we would then extract our eigenvalues.
What could be simpler than reducing the inputs into their Optimum Classifying Coordinates and simply pick the first N terms for column inputs.
Is it true, that you could load up the inputs with everything including the kitchen sink and simply let DA extract the Principle CS? If you had poorly separated inputs and poor corr w/outputs, wouldn't those "garbage" columns simply not contribute to the Prin CS determined by DA?
If yes, this would really make things easy.
NOTE: It really makes modal analysis of a simple structure quite easy. However, with non-uniform, complex structures, maybe something like an airframe, you may be interested in eigenvalues which are determined by local structure. What is worse is that when you look at all the modes of the complete structure, these local, "fundamental" modes could be mode number 77, 132, 250, 537, and 1200. Clearly generalized dynamic reduction would require a tremendous number of DOF and would not be as efficient as hand selecting a set of nodes in the local area that would capture the desired mode shapes.
I would guess THAT could be a phenomenon that might appear in neural network input reduction.
What do you think?
____________________
Anyhow, I guess I'm "data mining" trying to find anything that is "different" from what I am currently using to give the net some "wisdom."
YOUR ITEM #4: Use the data matrix correlation coefficient matrix CC=COFFCOEF(Z) to identify:
a. undesirable low corr between y and col of x.
b. undesirable high corr between col of x.
Results of manual selection experiments:
A) Eliminate Cxx (in-in corr) > .8
This eliminated about 49 or 186 inputs.
Then I looked at
B) Eliminate "zero-ish" Cxy (in-out corr) = [-.1,.1]
This eliminated about 1/2 the remaining inputs, however, a significant fraction of the A types were quite high with Cxy around .2
So I looked at Cxy intervals of +_.05, .02, and .01
Turned out .05 eliminated another dozen or so with little "contradiction" with the original Cxx A set.
So the question is "What is a stalemate?" Is Cxx = .9 just as "bad" as Cxy = .1 ???
To be specific,
1) Do I keep?
Cxx = .9 and Cxy = .1 ?
Cxx = .8 and Cxy = .1 ?
Cxx = .7 and Cxy = .1 ?
Cxx = .9 and Cxy = .2 ?
Cxx = .8 and Cxy = .2 ?
Cxx = .7 and Cxy = .2 ? or
2) Is there a Cxx sufficiently high that it should be eliminated no matter how high Cxy? or a Cxy sufficiently high that it should be kept no matter how high Cxx?
I don't know.
Obviously, I could create rules based on 1 or 2 ratio parameters and test the model, however, that is a lot of work to spend on what might turn out to be completely wasted effort. I would like to avoid writing code to perform the input selection.
HOWEVER, maybe the way to look at this is to view inputs as DOF and simply choose the number of inputs that you believe will characterize the problem.
Example: Use my "best" 100 inputs.
______________________________
So this is all going on while I'm ignoring my "pixel" pattern recognition input. Obviously only the first few lags of Price will satisfy Cxx and Cxy if at all. So many lags, or pixels, will be massive failures.
>On Oct 5, 7:01 pm, TomH488 <tom...@gmail.com> wrote:
> Greg,
> First, I have read your robust list of what to do.
What list?
> There are a few procedures that I have questions about:
> __________________
> YOUR ITEM #4: Use the data matrix correlation coefficient matrix CC=COFFCOEF(Z) to identify:
> a. undesirable low corr between y and col of x.
> b. undesirable high corr between col of x.
> Either one is not too bad a thing to do by hand, but doing both "by eye" is a bit much. I'm working with 186 inputs that I need to cull down.
I don't know what you mean "by eye"?
The documentation explains how to use additional confidence/
significance level outputs to quantify the significance of the 186
input-output linear correlations and the 186*185/2 input-input cross
correlations.
help corrcoef
doc corrcoef
If input cross correlations are significant you might want to
immediately remove some of the "redundant" variables or transform the
inputs to be uncorrelated.
You may also want to check cross correlations of y with 186 quadratic
(xi^2) and some of the 186*185/2 (ugh) interaction (xi*xj) terms.
> NOTE: I added 43 World Stock Indices and a unit Lag and many have to go.
> Sounds like MATLAB has STEPWISE and STEPWISEFIT to automate this. I was thinking of a selection algorithm but really don't know enough of the mathematics to make a trade between a. and b. when it occurs.
> Any suggestions how to wade through this expeditiously?
No guarantees.
STEPWISEFIT and STEPWISE(Gui version) are useful for models that are
linear in the coefficients ( LIC: e.g., polynomials). The chosen
inputs are good, but not nececessarily optimal for these models, much
less the nonlinear NN models.
Nevertheless, I usually start by comparing backward and forward
results for a linear in variable (LIV) model and, I might do the same
for a linear in interactions (xi*xj) or a pure quadratic (xi^2) model.
> _________________________
> Apparent popularity of using ACF and PACF to determine LAGS for a Forecasting Prediction.
ACF = Autocorrelation function?
PACF = ??
> One paper said:
> 1) use L1...Ln which have significant PACF's, AND
> 2) Use Li of the 4 top ACF's.
> That is clear enough for predicting Price from Price input, however, I have some other issues to deal with:
You are tring to predict price? What do you mean by "price input"?
> ISSUE ONE:
> I have many other non-price inputs that I include LAGS. Since they are not Price, they should be not-uncorrelated with Output YET STILL not-correlated to each other. So should a candidate LAG be REQUIRED to satisfy both constraints? or if not possible, should the least offensive ones be used or should NO lags be used since none were suitable?
> Well, that's confusing enough.
Yes it is. significant lags can be determined from auto and
crosscorrelation functions.
> ISSUE TWO:
> While I am doing forecasting, I'm also doing pattern recognition - my "stock chart." Although these columns like like Price and dozens of lags, its really a set of Price Pixels.
You lost me.
> So all this ACF, PACF, CORR stuff, I get the impression is has not too much to do with pixel input.
I associate pixels with images. You lost me again.
> NOTE: I was thinking if you were trying to predict the trajectory of a single black pixel image, you might look at a particular pixel output being determined by only a local neighborhood of pixels. Perhaps this would be a "PACF" style simplification.
> Nevertheless, I'm not sure what to do about my stock chart columns since they really violate the Price LAG method based on PACF and ACF.
> _____________________
> I'm really sold on PCA (Principle Coordinate Analysis) but really Discriminant Analysis (also or now called PLS, Partial Least Squares?) due to the Parallel Cigar problem example.
> In Modal Analysis of mega-large finite element models, we always did Generalized Dynamic Reduction which found the first n requested Generalized Coordinates upon which we would then extract our eigenvalues.
> What could be simpler than reducing the inputs into their Optimum Classifying Coordinates and simply pick the first N terms for column inputs.
> Is it true, that you could load up the inputs with everything including the kitchen sink and simply let DA extract the Principle CS? If you had poorly separated inputs and poor corr w/outputs, wouldn't those "garbage" columns simply not contribute to the Prin CS determined by DA?
> If yes, this would really make things easy.
> NOTE: It really makes modal analysis of a simple structure quite easy. However, with non-uniform, complex structures, maybe something like an airframe, you may be interested in eigenvalues which are determined by local structure. What is worse is that when you look at all the modes of the complete structure, these local, "fundamental" modes could be mode number 77, 132, 250, 537, and 1200. Clearly generalized dynamic reduction would require a tremendous number of DOF and would not be as efficient as hand selecting a set of nodes in the local area that would capture the desired mode shapes.
> I would guess THAT could be a phenomenon that might appear in neural network input reduction.
> What do you think?
> ____________________
> Anyhow, I guess I'm "data mining" trying to find anything that is "different" from what I am currently using to give the net some "wisdom."
You lost me on modal analysis as well as mixing classification with
timeseries prediction.
> YOUR ITEM #4: Use the data matrix correlation coefficient matrix CC=COFFCOEF(Z) to identify:
> a. undesirable low corr between y and col of x.
> b. undesirable high corr between col of x.
> Results of manual selection experiments:
> A) Eliminate Cxx (in-in corr) > .8
> This eliminated about 49 or 186 inputs.
> Then I looked at
> B) Eliminate "zero-ish" Cxy (in-out corr) = [-.1,.1]
> This eliminated about 1/2 the remaining inputs, however, a significant fraction of the A types were quite high with Cxy around .2
> So I looked at Cxy intervals of +_.05, .02, and .01
> Turned out .05 eliminated another dozen or so with little "contradiction" with the original Cxx A set.
> So the question is "What is a stalemate?" Is Cxx = .9 just as "bad" as Cxy = .1 ???
> To be specific,
> 1) Do I keep?
> Cxx = .9 and Cxy = .1 ?
> Cxx = .8 and Cxy = .1 ?
> Cxx = .7 and Cxy = .1 ?
> Cxx = .9 and Cxy = .2 ?
> Cxx = .8 and Cxy = .2 ?
> Cxx = .7 and Cxy = .2 ? or
> 2) Is there a Cxx sufficiently high that it should be eliminated no matter how high Cxy? or a Cxy sufficiently high that it should be kept no matter how high Cxx?
> I don't know.
I haven't used this technique more than a few times because of
stagewisefit. However, variables for which abs(Cyxi) > max(j~=i)
{Cxjxi} would certainly catch my attention.
> Obviously, I could create rules based on 1 or 2 ratio parameters and test the model, however, that is a lot of work to spend on what might turn out to be completely wasted effort. I would like to avoid writing code to perform the input selection.
> HOWEVER, maybe the way to look at this is to view inputs as DOF and simply choose the number of inputs that you believe will characterize the problem.
> Example: Use my "best" 100 inputs.
Do you mean the xi w.r.t. the top M abs(Cyxi)
where M < rank(X) ?
Why not the xi from the top M
abs(Cyxi) - max(j~=i){abs(Cxjxi)} ?
______________________________
> So this is all going on while I'm ignoring my "pixel" pattern recognition input. Obviously only the first few lags of Price will satisfy Cxx and Cxy if at all. So many lags, or pixels, will be massive failures.
> Perhaps all this is ignored for pixel inputs?
You are dealing with a nonlinear problem. Using a series of linear
model tools may, at best, reveal obviously bad and obviously good
input features.
1. Standardize all variables (zero-mean/unit-variance)
using zscore or mapstd.
2. Estimate the practical dimensionality of the variable spaces via
the rank and condition number of X and Z
(if the ranks are the same, a linear model is sufficient).
3.Tranform X to the eigenspace of the correlation coefficient matrix
(same as cov(X) since X is standardized)
4. With a lot of input candidates, you could begin by transforming to
the subspace of full rank eigenvectors and, if necessary, consider the
original variables later.
5. Use stagewisefit on the transformed inputs.
6. It is always interesting to compare abs(Cyxi) with max(j ~= i)
{ abs(Cxjxi) } for chosen and unchosen transformed variables.
7. For additional info, you might want to use stagewisefit on a model
of squared and/or interaction (xj*xi) terms of the unchosen variables.
8.Design a static NN using y and/or y-yhat where yhat
is the result of stagewisefit.
9. Use backward search to choose the best variables.
=========================================
10. Use crosscorrelation functions to estimate which
lagged input variables to consider.
> Newsgroups: comp.ai.neural-nets
> From: TomH488 <tom...@gmail.com>
> Date: Mon, 8 Oct 2012 07:12:57 -0700 (PDT)
> Local: Mon, Oct 8 2012 10:12 am
> Subject: Re: Input Pre-Processing
> On Oct 8, 10:12 am, TomH488 <tom...@gmail.com> wrote:
> > More thinking on:
> > YOUR ITEM #4: Use the data matrix correlation coefficient matrix CC=COFFCOEF(Z) to identify:
> > a. undesirable low corr between y and col of x.
> > b. undesirable high corr between col of x.
> > Results of manual selection experiments:
> > A) Eliminate Cxx (in-in corr) > .8
> > This eliminated about 49 or 186 inputs.
> > Then I looked at
> > B) Eliminate "zero-ish" Cxy (in-out corr) = [-.1,.1]
> > This eliminated about 1/2 the remaining inputs, however, a significant fraction of the A types were quite high with Cxy around .2
> > So I looked at Cxy intervals of +_.05, .02, and .01
> > Turned out .05 eliminated another dozen or so with little "contradiction" with the original Cxx A set.
> > So the question is "What is a stalemate?" Is Cxx = .9 just as "bad" as Cxy = .1 ???
> > To be specific,
> > 1) Do I keep?
> > Cxx = .9 and Cxy = .1 ?
> > Cxx = .8 and Cxy = .1 ?
> > Cxx = .7 and Cxy = .1 ?
> > Cxx = .9 and Cxy = .2 ?
> > Cxx = .8 and Cxy = .2 ?
> > Cxx = .7 and Cxy = .2 ? or
> > 2) Is there a Cxx sufficiently high that it should be eliminated no matter how high Cxy? or a Cxy sufficiently high that it should be kept no matter how high Cxx?
> > I don't know.
> I haven't used this technique more than a few times because of
> stagewisefit. However, variables for which abs(Cyxi) > max(j~=i)
> {Cxjxi} would certainly catch my attention.
> > Obviously, I could create rules based on 1 or 2 ratio parameters and test the model, however, that is a lot of work to spend on what might turn out to be completely wasted effort. I would like to avoid writing code to perform the input selection.
> > HOWEVER, maybe the way to look at this is to view inputs as DOF and simply choose the number of inputs that you believe will characterize the problem.
> > Example: Use my "best" 100 inputs.
> Do you mean the xi w.r.t. the top M abs(Cyxi)
> where M < rank(X) ?
> Why not the xi from the top M
> abs(Cyxi) - max(j~=i){abs(Cxjxi)} ?
> ______________________________
> > So this is all going on while I'm ignoring my "pixel" pattern recognition input. Obviously only the first few lags of Price will satisfy Cxx and Cxy if at all. So many lags, or pixels, will be massive failures.
> > Perhaps all this is ignored for pixel inputs?
> You are dealing with a nonlinear problem. Using a series of linear
> model tools may, at best, reveal obviously bad and obviously good
> input features.
> 1. Standardize all variables (zero-mean/unit-variance)
> using zscore or mapstd.
> 2. Estimate the practical dimensionality of the variable spaces via
> the rank and condition number of X and Z
> (if the ranks are the same, a linear model is sufficient).
> 3.Tranform X to the eigenspace of the correlation coefficient matrix
> (same as cov(X) since X is standardized)
> 4. With a lot of input candidates, you could begin by transforming to
> the subspace of full rank eigenvectors and, if necessary, consider the
> original variables later.
> 5. Use stagewisefit on the transformed inputs.
> 6. It is always interesting to compare abs(Cyxi) with max(j ~= i)
> { abs(Cxjxi) } for chosen and unchosen transformed variables.
> 7. For additional info, you might want to use stagewisefit on a model
> of squared and/or interaction (xj*xi) terms of the unchosen variables.
> 8.Design a static NN using y and/or y-yhat where yhat
> is the result of stagewisefit.
> 9. Use backward search to choose the best variables.
> =========================================
10. Use the autocorrelation and partial autocorrelation functions to
determine which output feedback delays to consider.
11. Use crosscorrelation and partial crosscorrelation functions to
estimate which lagged input variables to consider.
P.S. I just learned from you about the existence and usefullness of
partial correlations for determining significant lags. Now I need to
look into it further.
Thanks Greg for the thorough replies - I'll have to digest a lot of what you have written.
Remember, I do not have MATLAB so while I mention some of their functions, I only do so in trying to understand how some of these evaluations can be automated.
As far as culling by C, AC, and PAC, while I can calculate those numbers, the selection criterion is completely qualitative. So I call that "picking them by eye" and always use some "Kentucky windage!"
I an very happy to learn that maybe I have made you aware of something (PACF's) that you will find useful !!!
> > There are a few procedures that I have questions about:
> > __________________
> > YOUR ITEM #4: Use the data matrix correlation coefficient matrix CC=COFFCOEF(Z) to identify:
> > a. undesirable low corr between y and col of x.
> > b. undesirable high corr between col of x.
> > Either one is not too bad a thing to do by hand, but doing both "by eye" is a bit much. I'm working with 186 inputs that I need to cull down.
> I don't know what you mean "by eye"?
When you have to make a compromise selection such as not so bad Cxx yet not so good Cxy, do you keep or cull? Besides, I have no idea how to make that decision quantitatively anyhow - my guess it depends on the model which makes it empirical. Or can the decision be calculated?
> The documentation explains how to use additional confidence/
> significance level outputs to quantify the significance of the 186
> input-output linear correlations and the 186*185/2 input-input cross
> correlations.
First, just so I know we're talking about the same thing, I call
Cxx the input-input cross corr which are probably bad if near 1.0
Cxy the in-out linear corr which are of no value if near 0.0
Are you now referring to "Steps 7 & 8"
7) Test the strength of combined linear I/O correlations by calculating the mean square errors, and
8) If I/O correlations are sufficiently high, the training goal MSE = 0.01*MSE0 is reasonable
The don't exactly understand the notation, but I get the idea we're talking about training the network with a single input column and a simplified y (mean, linear regression line, ???) ??? I really have no idea here, especially how to interpret the results.
Again, NO hablo MATLAB
> help corrcoef
> doc corrcoef
> If input cross correlations are significant you might want to
> immediately remove some of the "redundant" variables or transform the
> inputs to be uncorrelated.
My "independent" culls (based on Cxx only or Cxy only) consist of omitting Cxx above 0.8, and Cxy within .05 of zero. The philosophy is that these are so bad, it doesn't matter how good the other C.. would have been.
Again, what to do with the ones that are not so bad.
Would PCA or better yet, Discriminant Analysis (or PLS, Partial Least Squares) simply remove all concern about culling inputs? My experience from structural modal analysis says yes, a simple way to deal with input size reduction.
> You may also want to check cross correlations of y with 186 quadratic
> (xi^2) and some of the 186*185/2 (ugh) interaction (xi*xj) terms.
Don't know how to do this. But PLS is sounding like a simple solution for "manual culling based Cxx & Cxy correlations."
> > NOTE: I added 43 World Stock Indices and a unit Lag and many have to go.
> > Sounds like MATLAB has STEPWISE and STEPWISEFIT to automate this. I was thinking of a selection algorithm but really don't know enough of the mathematics to make a trade between a. and b. when it occurs.
> > Any suggestions how to wade through this expeditiously?
> No guarantees.
> STEPWISEFIT and STEPWISE(Gui version) are useful for models that are
> linear in the coefficients ( LIC: e.g., polynomials). The chosen
> inputs are good, but not nececessarily optimal for these models, much
> less the nonlinear NN models.
> Nevertheless, I usually start by comparing backward and forward
> results for a linear in variable (LIV) model and, I might do the same
> > I have many other non-price inputs that I include LAGS. Since they are not Price, they should be not-uncorrelated with Output YET STILL not-correlated to each other. So should a candidate LAG be REQUIRED to satisfy both constraints? or if not possible, should the least offensive ones be used or should NO lags be used since none were suitable?
> > Well, that's confusing enough.
> Yes it is. significant lags can be determined from auto and
> crosscorrelation functions.
I've never found any methodology of picking lags this way in the FAQs. Did I miss something?
Also, seems like we have 3 correlations in this discussion:
1) Correlation (generic term? needs adjective to be specific?)
2) Cross Correlation (these are the Cxx & Cxy?)
3) Auto Correlation (know this one) 4) Partial Auto Correlation (know this one)
Humm, if Cross Corr can help determine lags, then I don't know this method. The only one I have found is the ACF and PACF method.
Need Example (I work best from Examples - the more graphical the better. I am worst at working with mathematic equations - don't show me the matrix equations for a particular Nnet architecture, draw me the graph(!)
> > ISSUE TWO:
> > While I am doing forecasting, I'm also doing pattern recognition - my "stock chart." Although these columns like like Price and dozens of lags, its really a set of Price Pixels.
> You lost me.
A "stock Chart" is an "image."
In pure pattern recognition, you have a video camera which takes a picture of a part a robot wants to pick up or a target that needs to be identified before it is shot at.
This is all pixel input. Each pixel is a column.
When you think of culling pixels, they would probably be "peripheral" ones.
What is even more interesting is, is there a way to define the input so that it is known which pixels are adjacent? That is, rather than have data from a random list of pixels, to have data from a x,y mesh of pixels. Or does it even matter.
> > So all this ACF, PACF, CORR stuff, I get the impression is has not too much to do with pixel input.
> I associate pixels with images. You lost me again.
A stock chart is an image too. It can be a (x,y) series too handled by 2 columns which is what a line function can be broken down into when pixels of the chart are pre-processed.
> > NOTE: I was thinking if you were trying to predict the trajectory of a single black pixel image, you might look at a particular pixel output being determined by only a local neighborhood of pixels. Perhaps this would be a "PACF" style simplification.
> > Nevertheless, I'm not sure what to do about my stock chart columns since they really violate the Price LAG method based on PACF and ACF.
> > _____________________
> > I'm really sold on PCA (Principle Coordinate Analysis) but really Discriminant Analysis (also or now called PLS, Partial Least Squares?) due to the Parallel Cigar problem example.
> > In Modal Analysis of mega-large finite element models, we always did Generalized Dynamic Reduction which found the first n requested Generalized Coordinates upon which we would then extract our eigenvalues.
> > What could be simpler than reducing the inputs into their Optimum Classifying Coordinates and simply pick the first N terms for column inputs.
> > Is it true, that you could load up the inputs with everything including the kitchen sink and simply let DA extract the Principle CS? If you had poorly separated inputs and poor corr w/outputs, wouldn't those "garbage" columns simply not contribute to the Prin CS determined by DA?
> > If yes, this would really make things easy.
> > NOTE: It really makes modal analysis of a simple structure quite easy. However, with non-uniform, complex structures, maybe something like an airframe, you may be interested in eigenvalues which are determined by local structure. What is worse is that when you look at all the modes of the complete structure, these local, "fundamental" modes could be mode number 77, 132, 250, 537, and 1200. Clearly generalized dynamic reduction would require a tremendous number of DOF and would not be as efficient as hand selecting a set of nodes in the local area that would capture the desired mode shapes.
> > I would guess THAT could be a phenomenon that might appear in neural network input reduction.
> > What do you think?
> > ____________________
> > Anyhow, I guess I'm "data mining" trying to find anything that is "different" from what I am currently using to give the net some "wisdom."
> You lost me on modal analysis as well as mixing classification with
> On Oct 8, 10:12 am, TomH488 <tom...@gmail.com> wrote:
> > More thinking on:
> > YOUR ITEM #4: Use the data matrix correlation coefficient matrix CC=COFFCOEF(Z) to identify:
> > a. undesirable low corr between y and col of x.
> > b. undesirable high corr between col of x.
> > Results of manual selection experiments:
> > A) Eliminate Cxx (in-in corr) > .8
> > This eliminated about 49 or 186 inputs.
> > Then I looked at
> > B) Eliminate "zero-ish" Cxy (in-out corr) = [-.1,.1]
> > This eliminated about 1/2 the remaining inputs, however, a significant fraction of the A types were quite high with Cxy around .2
> > So I looked at Cxy intervals of +_.05, .02, and .01
> > Turned out .05 eliminated another dozen or so with little "contradiction" with the original Cxx A set.
> > So the question is "What is a stalemate?" Is Cxx = .9 just as "bad" as Cxy = .1 ???
> > To be specific,
> > 1) Do I keep?
> > Cxx = .9 and Cxy = .1 ?
> > Cxx = .8 and Cxy = .1 ?
> > Cxx = .7 and Cxy = .1 ?
> > Cxx = .9 and Cxy = .2 ?
> > Cxx = .8 and Cxy = .2 ?
> > Cxx = .7 and Cxy = .2 ? or
> > 2) Is there a Cxx sufficiently high that it should be eliminated no matter how high Cxy? or a Cxy sufficiently high that it should be kept no matter how high Cxx?
> > I don't know.
> I haven't used this technique more than a few times because of
> stagewisefit. However, variables for which abs(Cyxi) > max(j~=i)
> {Cxjxi} would certainly catch my attention.
I'm not quite sure I understand your notation.
First of all, are not the xi and xj implied in Cxx?
Also, abs(Cxy) is always implied when I use it.
Now, Cxy > Cxx (want Cxy near 1 & Cxx near 0) sounds like you would keep these?
However, you say j~=i or approximately j=i or when we are looking at Cxx that are NEAR THE DIAGONAL??? If that is correct, please explain - I don't know why near the diagonal is important.
> > Obviously, I could create rules based on 1 or 2 ratio parameters and test the model, however, that is a lot of work to spend on what might turn out to be completely wasted effort. I would like to avoid writing code to perform the input selection.
> > HOWEVER, maybe the way to look at this is to view inputs as DOF and simply choose the number of inputs that you believe will characterize the problem.
> > Example: Use my "best" 100 inputs.
UNFORTUNATELY, starting from here to ...
> Do you mean the xi w.r.t. the top M abs(Cyxi)
> where M < rank(X) ?
> Why not the xi from the top M
> abs(Cyxi) - max(j~=i){abs(Cxjxi)} ?
...here, I don't know what M or "top M" is or how it relates to the Rank (x).
> > So this is all going on while I'm ignoring my "pixel" pattern recognition input. Obviously only the first few lags of Price will satisfy Cxx and Cxy if at all. So many lags, or pixels, will be massive failures.
> > Perhaps all this is ignored for pixel inputs?
> You are dealing with a nonlinear problem. Using a series of linear
> model tools may, at best, reveal obviously bad and obviously good
> input features.
Keep in mind, I don't have MATLAB, so I'll comment on the following steps:
> 1. Standardize all variables (zero-mean/unit-variance)
> using zscore or mapstd.
> 2. Estimate the practical dimensionality of the variable spaces via
> the rank and condition number of X and Z
What is Z?
> (if the ranks are the same, a linear model is sufficient).
Rank(X) = Rank(Z) ?
Linear Model implies Sigmoid not required?
> 3.Tranform X to the eigenspace of the correlation coefficient matrix
> (same as cov(X) since X is standardized)
In other words, just use cov(X)? Have to find some Excel add-ins since no MatLab
> 4. With a lot of input candidates, you could begin by transforming to
> the subspace of full rank eigenvectors and, if necessary, consider the
> original variables later.
First, full rank evects sounds good but why not shoot for orthogonal evects?
Also, how to do this rull rank Xform?
> 5. Use stagewisefit on the transformed inputs.
No hablo MatLab (!)
> 6. It is always interesting to compare abs(Cyxi) with max(j ~= i)
> { abs(Cxjxi) } for chosen and unchosen transformed variables.
Again, does j ~= i mean "near the diagonal?" If yes, need some explanation, if no, need clarification.
> 7. For additional info, you might want to use stagewisefit on a model
> of squared and/or interaction (xj*xi) terms of the unchosen variables.
Again, no Matlab, but also beyond my present level of comprehension. Examples?
> 8.Design a static NN using y and/or y-yhat where yhat
> is the result of stagewisefit.
Again, no Matlab. Don't know what to do here etiher.
> 9. Use backward search to choose the best variables.
Example? (just look backwards through above results?)
> =========================================
> 10. Use crosscorrelation functions to estimate which
> lagged input variables to consider.
I need an example here. More talk of this usage but haven't found an example. My only exmaple was the white paper where the fellow used the ACF and PACF to pick lags for the input.
> Hope this helps
> Greg
Thanks Greg,
My limitations are widespread but I'm working on their reduction,
Tom
> > > This eliminated about 1/2 the remaining inputs, however, a significant fraction of the A types were quite high with Cxy around .2
> > > So I looked at Cxy intervals of +_.05, .02, and .01
> > > Turned out .05 eliminated another dozen or so with little "contradiction" with the original Cxx A set.
> > > So the question is "What is a stalemate?" Is Cxx = .9 just as "bad" as Cxy = .1 ???
> > > To be specific,
> > > 1) Do I keep?
> > > Cxx = .9 and Cxy = .1 ?
> > > Cxx = .8 and Cxy = .1 ?
> > > Cxx = .7 and Cxy = .1 ?
> > > Cxx = .9 and Cxy = .2 ?
> > > Cxx = .8 and Cxy = .2 ?
> > > Cxx = .7 and Cxy = .2 ? or
> > > 2) Is there a Cxx sufficiently high that it should be eliminated no matter how high Cxy? or a Cxy sufficiently high that it should be kept no matter how high Cxx?
> > > I don't know.
> > I haven't used this technique more than a few times because of
> > stagewisefit. However, variables for which abs(Cyxi) > max(j~=i)
> > {Cxjxi} would certainly catch my attention.
> > > Obviously, I could create rules based on 1 or 2 ratio parameters and test the model, however, that is a lot of work to spend on what might turn out to be completely wasted effort. I would like to avoid writing code to perform the input selection.
> > > HOWEVER, maybe the way to look at this is to view inputs as DOF and simply choose the number of inputs that you believe will characterize the problem.
> > > Example: Use my "best" 100 inputs.
> > Do you mean the xi w.r.t. the top M abs(Cyxi)
> > where M < rank(X) ?
> > Why not the xi from the top M
> > abs(Cyxi) - max(j~=i){abs(Cxjxi)} ?
> > ______________________________
> > > So this is all going on while I'm ignoring my "pixel" pattern recognition input. Obviously only the first few lags of Price will satisfy Cxx and Cxy if at all. So many lags, or pixels, will be massive failures.
> > > Perhaps all this is ignored for pixel inputs?
> > You are dealing with a nonlinear problem. Using a series of linear
> > model tools may, at best, reveal obviously bad and obviously good
> > input features.
> > 1. Standardize all variables (zero-mean/unit-variance)
> > using zscore or mapstd.
> > 2. Estimate the practical dimensionality of the variable spaces via
> > the rank and condition number of X and Z
> > (if the ranks are the same, a linear model is sufficient).
> > 3.Tranform X to the eigenspace of the correlation coefficient matrix
> > (same as cov(X) since X is standardized)
> > 4. With a lot of input candidates, you could begin by transforming to
> > the subspace of full rank eigenvectors and, if necessary, consider the
> > original variables later.
> > 5. Use stagewisefit on the transformed inputs.
> > 6. It is always interesting to compare abs(Cyxi) with max(j ~= i)
> > { abs(Cxjxi) } for chosen and unchosen transformed variables.
> > 7. For additional info, you might want to use stagewisefit on a model
> > of squared and/or interaction (xj*xi) terms of the unchosen variables.
> > 8.Design a static NN using y and/or y-yhat where yhat
> > is the result of stagewisefit.
> > 9. Use backward search to choose the best variables.
> > =========================================
> 10. Use the autocorrelation and partial autocorrelation functions to
> determine which output feedback delays to consider.
OK, got this one I think.
> 11. Use crosscorrelation and partial crosscorrelation functions to
> estimate which lagged input variables to consider.
OK, I think I see this one, this is for adding lags of INPUTs to the set of input variables. I really need this one since right now, this is what I'm focusing on. Do you have any Examples or references/links?
> P.S. I just learned from you about the existence and usefullness of
> partial correlations for determining significant lags. Now I need to
> look into it further.
> Hope this helps
> Greg
Thanks so much!
Tom
PS. Have seen the first training of a net with Cxy and CXX culls (cull only obviously bad, Cxy=0 and Cxx=1) and result is dramatically improved error reduction per epoch during training. Error reduction per doubling of epochs increase from 1.5:1 to 2.5:1 and further, it is possible to train to much lower errors when using momentum = .9 to ,95 and RANDOM epoch order.
I can only imagine what the training would look like with orthogonal inputs obtained from Discriminant Analysis.