Propose Mixed Cardinal/Real Input Column

TomH488

unread,

Aug 24, 2012, 1:53:06 PM8/24/12

to

Lots of discussion of Cardinal inputs -v- real inputs.

However, a common problem is having data which is only recent and does not go back as far as other input data.

How to incorporate it? More specifically, what to do with early missing input values?

I have read that using the average of the existing data for the missing data is one way to deal with it.

I would agree only if one could say that the state of the system was the same for all input values.

On the other hand, if one could say that the state is probably different or even much different, if that state change could be identified, then a simple Cardinal column indicating State A or B would be of use.

Suppose we have data for the latter State B but nothing for the earlier State A?

I propose that State A values are ZERO while State B are the actual data.

The result is that you have useful "new" weights which apply only to State B. The Zeroes assigned for State A simply "turn off" those State B weights (as they should be.)

CAUTION: One must be careful that the ZEROes are applied after all standardization or transformation of the inputs. The last thing you need would be a big block of zeroes ruining the mean and standard deviation of the actual data.

IRONICALLY, when Standardized to Zero Mean, use of the Zeros is actually using the Mean.

Comments?

Thanks
Tom

TomH488

unread,

Sep 23, 2012, 8:51:11 AM9/23/12

to

anyone?

Greg Heath

unread,

Sep 23, 2012, 5:58:17 PM9/23/12

to

How many total variables? How many cases?

How many varables have missing early data? How much is missing?

The first thing I would do is see what difference it makes when using
all of the recent data
vs replacing some of the recent variables with means. If that isn't
satisfactory, I wouldn't
put much hope in using all of the data with past missing data replaced
by means.

Obviously, much depends on how important those variables are and the
ratio of recent data size to past data size.

Hope this helps.

Greg

TomH488

unread,

Sep 24, 2012, 11:00:39 AM9/24/12

to

180 = Total variables
3500 = cases
>50% of the variables missing early data
25 to 75% missing in a variable

Didn't quite understand your, "...replacing some of the recent variables with means." Does "recent variables" mean those with 75% missing data? and if yes, replacing what cases of that variable with means.

I'm using Ward Systems NeuroShell2 which is pretty much a black box. What I am suspicious of is that for a set of 75% missing variables (Middle East Stock Market Indices) I get an extremely high "Contribution Factor" which is not precisely defined.

Another potential warning sign is I get two completely different Contribution Factor plots when using the simple Momentum training versus the black box TurboProp1 which is supplied with the NeuroShell2.

Momentum results in a basically uniform contribution while TP1 results in 2-3x the ratio of max/min seen with Mom. With TP1, the 75% Middle East Indices are very high compared to when Mom is used. Either this is true, or there is some kind of problem with TP1 in the way it calculates contributions. I wouldn't put much money on a bet either way. But it just seems fishy that I can get about the same errors with both methods and get two completely contribution signatures. Frankly, the highly stratified TP1 is what I would have expected to see rather than the uniform-ish Mom results.

Also, TP1 is not the best - it gets confused easily and was replaced with TP2 which is not available for this particular software.

I'm sort of Standardizing the inputs, I use Zero Mean and about 3sigma = 1. I seemed to find execution anomalies when I set StDev = 1 resulting in values significantly outside (-1,1). NS2 seems to like values within (-1,1) for whatever reason. When data is missing, I'm using zero which is the mean. So that's where I get this "Cardinal/Variable" input.

Thanks so much,
Tom

PS. I believe an issue is letting the network know "what time it is." A continuous time input column doesn't do this in my mind. What happens as time passes is that you progress from one "state" into another. I suspect time would be better served by a separate column for an interval - say 3 months. If you model 3 years, you need 12 columns with a little "unit diagonal" partition so to speak. You would be switching in a unique set of weights for each period to adjust the baseline network for the particular state it is in.

One thing for sure, a monotonic time cannot be effective because State(2) is NOT the average of State(1) and State(3).

TomH488

unread,

Oct 5, 2012, 6:27:37 PM10/5/12

to

Just re-read your post and I may not have been clear:

I have 2 options with NeuroShell 2:

1) Replace missing values with the MEAN of the available data.
2) Replace missing values with ZERO.

Of course, if Standardized, the MEAN = ZERO so there is no difference.

Again, my simplistic thinking: The ZERO would simply "turn off" the weights associated with the existing data when that data was missing.

Sort of a [0,1] Cardinal multiplier of a Data Column.

I could try running model with only complete data but that would reduce my cases from 3500 to about 800. Model seemed to improve when I increased from 2500 to 3500 so I'm not that optimistic.

However, see my new post - I really getting into Input Pre-Processing which should have been done long ago but I'm working here in a vacuum.

Thanks
Greg