I think it is still useful to have a single number (or a set of
numbers measuring different aspects of quality) so that we can rank
programs in a table. Right now, I'm using average percentile, which
is invariant to normalizing by the minimum error.
The other challenge is rating datasets. Right now, I'm using standard
deviation of error, but that doesn't seem like a good metric (in
particular, some regression programs have astronomically high ratings,
which seems bad). Anyone have a better idea?
-Percy
I put a tab-separated file here:
http://mlcomp.org/public/download/runs.txt
which will be auto-generated as we have more runs.
The first row describes the columns.
I think there's quite a bit of interesting meta-data-analysis to be done here.
> Regarding the rating of datasets, standard deviation of error isn't
> too bad. This metric gives some indication of whether the dataset is
> specialized or generalized. If you are having numerical problems, I
> would suggest
> 1. using the IQR instead of the standard deviation,
> 2. only quoting the metric if a large number of algorithms have been
> applied to the data set
Those are good suggestions.
1. The problem is with regression and Collaborative filtering. The
current error metrics used are not scale sensitive. In some
regression problems, the outputs are huge and others, they're small.
I think we should probably be reporting some normalized error metric
(divide by total variance of Y or something) rather than plain MSE.
Question: what do people care about in regression? Then, perhaps IQR
or standard deviation I think would be reasonable.
2. I just changed it so a rating is only quoted for a dataset if it
has more than 5 programs run on it.
Cheers,
-Percy
Perhaps one should only take the top 50% of programs on a dataset.
Intuition: a dataset is good if "reasonable" attempts at solving the
problem have large variable. If some lame program gets huge error, we
ought to not count that. Does this seem reasonable?
Taking the log seems like a very good idea. This renders
normalization by min or median and all the worries of sensitivity to
scales in regression problems obsolete. Let E be a random variable
denoting the error of a randomly chosen program on a dataset. Then
var(\log E) = var(\log (c E)) for any scaling c. This seems like a
nice property.
I just changed the rating code and the site now reflects the new
log-based rating system for datasets. (One has to worry a bit about
corner cases when E = 0 is possible. I just set a floor of 1e-4 for
now.) The numbers at least seem more reasonable...
-Percy
-Percy