Adaboost Questions

14 views
Skip to first unread message

Joseph Burley

unread,
Mar 4, 2013, 12:01:48 AM3/4/13
to cs6...@googlegroups.com
The datasets seem to have quite a few unknown values ('?') mixed through the various datapoints for varying features. I started by ignoring these datapoints while splitting on a feature; but I was wondering if it made any sense to do any of the following:

 1. When the feature is a continuous feature (real values), "fixup" the dataset by replacing '?' with the mean of the other values for that feature.
 2. When the feature is a discrete feature (specific values), "fixup" the dataset by using '?' as a distinct value.

If we shouldn't "fixup" the datasets, how should we treat these datapoints when updating our distribution? Should we leave these points' weights the same (instead of increasing or decreasing)?

cs15...@gmail.com

unread,
Mar 12, 2013, 1:31:10 PM3/12/13
to cs6...@googlegroups.com
Both 1 and 2 are good ideas. You can also try:
- ignore training records with missing values
- predict (simple Naive bayes) the missing values
- classify with the weak decision stump the missing value as the majority label for that decision stump

In any case, make sure to clearly state what you have done in the report.

--virgil
Reply all
Reply to author
Forward
0 new messages