In order to make IFCSoft more robust and useful with other types of
data, it should be able to handle loading and categorical data and
data points that are missing values:
Missing Data: Issue 9 (
http://code.google.com/p/ifcsoft/issues/detail?
id=9)
Categorical Data: Issue 10 (
http://code.google.com/p/ifcsoft/issues/
detail?id=10)
I want to discuss these together since they both will be changing the
same code sections (data sets and SOM algorithm) and perhaps should be
done in conjunction, or at least with a knowledge of how the other
problem will be solved.
Categorical Data:
Getting numerical categorical data to work should be relatively
straightforward. I suggest adding to the "Choose SOM Weights" dialog
box a way of setting a channel to be handled categorically. The
default weight for categorical variables should be lower (maybe .25)
or they will dominate the continuous variables. Which channels are
categorical would have to be passed to the SOM somehow to the SOM
algorithm. In the SOM algorithm, when computing euclidean distance, if
the dimension is categorical and the two numbers are different, they
will be a "distance" of the weight.
Loading non-numerical categorical data will be harder. In order to not
use too much space for large data sets, we'll probably have to make
numbers to represent the different strings (or we could just use a
single String object for each unique string). We'll also have to have
a way of marking which dimensions are categorical and whether they are
string categories. Combining data sets will be complicated (to make
sure all the categories align). We may want an interface like with the
column synchronizing if categories names differ by capitalization or
abbreviation.
Missing Data Point Values:
The change to the SOM algorithm is rather simple. You just do
Euclidean distance using the dimensions you have. In the Iterative
SOM, you just change the dimensions that the data point had. In the
Batch SOM, when you do the average in each dimension, you use what
values you have.
I don't know the best way to handle the missing Data Point Values in
the data sets. How do we mark a value as being missing? Do we just
save it as an NaN? Would there be another case where you would want a
distinct NaN that doesn't mean "value missing?"
We can work on any of these pieces next. Any suggestions for how best
to do this?
Kyle