Categorical Data and Missing Data

kthayer

unread,

May 24, 2011, 5:02:53 PM5/24/11

to IFCSoft

In order to make IFCSoft more robust and useful with other types of
data, it should be able to handle loading and categorical data and
data points that are missing values:

Missing Data: Issue 9 (http://code.google.com/p/ifcsoft/issues/detail?
id=9)
Categorical Data: Issue 10 (http://code.google.com/p/ifcsoft/issues/
detail?id=10)

I want to discuss these together since they both will be changing the
same code sections (data sets and SOM algorithm) and perhaps should be
done in conjunction, or at least with a knowledge of how the other
problem will be solved.

Categorical Data:

Getting numerical categorical data to work should be relatively
straightforward. I suggest adding to the "Choose SOM Weights" dialog
box a way of setting a channel to be handled categorically. The
default weight for categorical variables should be lower (maybe .25)
or they will dominate the continuous variables. Which channels are
categorical would have to be passed to the SOM somehow to the SOM
algorithm. In the SOM algorithm, when computing euclidean distance, if
the dimension is categorical and the two numbers are different, they
will be a "distance" of the weight.

Loading non-numerical categorical data will be harder. In order to not
use too much space for large data sets, we'll probably have to make
numbers to represent the different strings (or we could just use a
single String object for each unique string). We'll also have to have
a way of marking which dimensions are categorical and whether they are
string categories. Combining data sets will be complicated (to make
sure all the categories align). We may want an interface like with the
column synchronizing if categories names differ by capitalization or
abbreviation.

Missing Data Point Values:

The change to the SOM algorithm is rather simple. You just do
Euclidean distance using the dimensions you have. In the Iterative
SOM, you just change the dimensions that the data point had. In the
Batch SOM, when you do the average in each dimension, you use what
values you have.

I don't know the best way to handle the missing Data Point Values in
the data sets. How do we mark a value as being missing? Do we just
save it as an NaN? Would there be another case where you would want a
distinct NaN that doesn't mean "value missing?"

We can work on any of these pieces next. Any suggestions for how best
to do this?

Kyle

Tony

unread,

Jun 5, 2011, 10:00:23 PM6/5/11

to IFCSoft

On Missing Data:
I think using NaN should be fine--there aren't any obvious cases where
the algorithm should attempt to divide by zero or spit out NaN
otherwise, right? Let's implement it and find out.

Kyle Thayer

unread,

Jun 6, 2011, 3:53:23 PM6/6/11

to ifc...@googlegroups.com

Let's work on it on a separate clone so that we can make commits without breaking the main one. Do you want to add me as a committer to your clone or should I make a new clone for this?

For a first step let's load NaN's into their appropriate places and get the SOM to work without linear initialization (that will be harder). So what I can think of that will need to be changed is:

* CSV loader placing NaN's in the file
* Data set means/maxes/mins/std devs need to still work
* Random initialization of SOM needs to make sure it doesn't give a node an NaN value
* In the SOM algorithm, Euclidean distance should ignore the dimensions that have an NaN
* When updating a node, it needs to not change the dimensions that the given value had an NaN

Later fixes:

* Union data calculates min/max/mean from the base sets, but the weighted averaging wont work if not all points were present. More will need to be done to say how many points had each dimension.
* Linear init will be fairly difficult to find a good solution for.

Kyle

Tony

unread,

Jun 12, 2011, 11:59:49 PM6/12/11

to IFCSoft

I'll make a clone and commit you (Kyle). Do you want to try to tackle
these problems together (since both will require an overhaul) or
separately (since they're both fairly complex)?

Tony

Kyle Thayer

unread,

Jun 15, 2011, 2:03:16 AM6/15/11

to ifc...@googlegroups.com

That sounds good. I'd say let's only do one at a time, but keep the other in mind so we don't get ourselves painted into a corner. Let me know when it is up and we can start trying to make it happen.
Kyle

Kyle Thayer

unread,

Jul 13, 2011, 7:06:14 PM7/13/11

to ifc...@googlegroups.com

I've started implementing the missing data handling, but I cannot commit to your "aaboyles-breakable" clone. Can you add me to that?

Kyle

Kyle Thayer

unread,

Jul 14, 2011, 2:49:21 PM7/14/11

to ifc...@googlegroups.com

Well, it turns out that google code only allows one committer per clone (http://groups.google.com/group/google-code-hosting/browse_thread/thread/42a221dfa3f83edd/c8d94faf27925731?lnk=gst&q=clone#c8d94faf27925731). So I made my own clone here:(http://code.google.com/r/kylethayer-breakable/). You can pull the changes I've made from there and make changes within your clone if you want. I've only got things partially working and haven't gone through and verified that it's giving the right values for means and standard deviations.

Kyle

Tony Boyles

unread,

Jul 14, 2011, 5:25:42 PM7/14/11

to Kyle Thayer, ifc...@googlegroups.com

Cool! I'll pull it down and take a look.

Thanks,

Tony

Kyle Thayer

unread,

Aug 9, 2011, 5:43:16 PM8/9/11

to ifc...@googlegroups.com

Version 0.4.1 has been uploaded which now handles missing data. It can be run off the main website.

Kyle

Reply all

Reply to author

Forward