quality metric for programs

shabbychef

unread,

Oct 25, 2010, 1:41:23 PM10/25/10

to MLcomp

(google groups will not let me 'reply' to the original message. how
annoying)

I would suggest using the 'benchmark plot' approach of Dolan and
More. These plots were designed to compare the run-times (wall-clock
and iterations) of various optimization algorithms on some standard
problems. MLcomp can use them to compare error rates. (In both cases
the metric in question is a non-negative quantity that one would want
to minimize).

To be precise, for each data set i and each algorithm j, let E_i,j be
the error rate from applying algorithm j to data set i. Let E_i,j be
infinity if the given algorithm was not applied to the given data
set. Then define M_i = min_j E_i,j . This is the minimum error for
the dataset over all algorithms tried. Then define F_i,j = E_i,j /
M_i . This ratio is the factor by which the error of algorithm j was
worse than the optimal algorithm when applied to dataset i. To make
the benchmark plot, for each algorithm, plot the empirical CDF of the
factors F_i,j (that is, over all i). To be fair, you probably want to
not count those F_i,j which are infinite (i.e. for which the algorithm
was not applied to the data set).

see, for example, http://www.gamsworld.org/performance/pprofile.htm or
http://www.gamsworld.org/performance/profileGnuplot.gif

note that a benchmark plot is somewhat more nuanced than having a
single 'quality' metric associated with an algorithm. Imagine, for
example, an algorithm which is a good generalist: it works reasonably
well on all data sets. When should one use this algorithm instead of a
specialist algorithm which is very good for certain data sets, but
terrible for others? How should one 'rank' these two algorithms? A
benchmark plot would show this character, but I'm afraid that a single
number would lose the detail.

Percy Liang

unread,

Oct 26, 2010, 12:20:52 AM10/26/10

to mlc...@googlegroups.com

Thanks for the suggestion. The point about the nuances of the plot is
a good one. We could have one of these CDF plots on each page for a
program. Anyone know a good package for making such plots that embeds
nicely on websites?

I think it is still useful to have a single number (or a set of
numbers measuring different aspects of quality) so that we can rank
programs in a table. Right now, I'm using average percentile, which
is invariant to normalizing by the minimum error.

The other challenge is rating datasets. Right now, I'm using standard
deviation of error, but that doesn't seem like a good metric (in
particular, some regression programs have astronomically high ratings,
which seems bad). Anyone have a better idea?

-Percy

shabbychef

unread,

Oct 26, 2010, 1:01:21 PM10/26/10

to MLcomp

If I could get all the data on all the runs in a single CSV file, I
could generate a mockup. The data ideally would be in the form of 4-
tuples: (domain, dataset, algo, error). the code that I have has some
copyright issues, so this wouldn't be a good long term solution. I
would guess this can easily be done in R very well.

Regarding the rating of datasets, standard deviation of error isn't
too bad. This metric gives some indication of whether the dataset is
specialized or generalized. If you are having numerical problems, I
would suggest
1. using the IQR instead of the standard deviation,
2. only quoting the metric if a large number of algorithms have been
applied to the data set

Percy Liang

unread,

Oct 28, 2010, 3:23:16 AM10/28/10

to mlc...@googlegroups.com

> If I could get all the data on all the runs in a single CSV file, I
> could generate a mockup. The data ideally would be in the form of 4-
> tuples: (domain, dataset, algo, error). the code that I have has some
> copyright issues, so this wouldn't be a good long term solution. I
> would guess this can easily be done in R very well.

I put a tab-separated file here:
http://mlcomp.org/public/download/runs.txt
which will be auto-generated as we have more runs.
The first row describes the columns.

I think there's quite a bit of interesting meta-data-analysis to be done here.

> Regarding the rating of datasets, standard deviation of error isn't
> too bad. This metric gives some indication of whether the dataset is
> specialized or generalized. If you are having numerical problems, I
> would suggest
> 1. using the IQR instead of the standard deviation,
> 2. only quoting the metric if a large number of algorithms have been
> applied to the data set

Those are good suggestions.
1. The problem is with regression and Collaborative filtering. The
current error metrics used are not scale sensitive. In some
regression problems, the outputs are huge and others, they're small.
I think we should probably be reporting some normalized error metric
(divide by total variance of Y or something) rather than plain MSE.
Question: what do people care about in regression? Then, perhaps IQR
or standard deviation I think would be reasonable.

2. I just changed it so a rating is only quoted for a dataset if it
has more than 5 programs run on it.

Cheers,

-Percy

shabbychef

unread,

Oct 29, 2010, 2:10:10 PM10/29/10

to MLcomp

I cannot seem to download that file: chromium thinks mlcomp.org/public
does not exist. Perhaps the permissions are off? I am trying via ftp
as well, but no luck.

Regarding the scale of error metrics on data sets, perhaps one could
use the same trick that benchmark plots do: divide the error
associated with a (dataset,algo) 2-tuple by the minimum over all algos
on that dataset. Another approach would be to divide e.g. by the
median over all algos on that dataset, then take the log to get a zero-
centered metric. Once you do this, the standard deviation should be
more well behaved.

Percy Liang

unread,

Oct 29, 2010, 2:15:51 PM10/29/10

to mlc...@googlegroups.com

Sorry, the link is:
http://mlcomp.org/download/runs.txt

-Percy

Percy Liang

unread,

Oct 29, 2010, 2:35:15 PM10/29/10

to mlc...@googlegroups.com

The problem with dividing by the minimum error and computing standard
deviation is that for some datasets, the minimum error is 1e-4 and
some errors are 1000. Divide and you get a huge number, which doesn't
match my intuition of necessarily being a "good" dataset.

Perhaps one should only take the top 50% of programs on a dataset.
Intuition: a dataset is good if "reasonable" attempts at solving the
problem have large variable. If some lame program gets huge error, we
ought to not count that. Does this seem reasonable?

Taking the log seems like a very good idea. This renders
normalization by min or median and all the worries of sensitivity to
scales in regression problems obsolete. Let E be a random variable
denoting the error of a randomly chosen program on a dataset. Then
var(\log E) = var(\log (c E)) for any scaling c. This seems like a
nice property.

I just changed the rating code and the site now reflects the new
log-based rating system for datasets. (One has to worry a bit about
corner cases when E = 0 is possible. I just set a floor of 1e-4 for
now.) The numbers at least seem more reasonable...

-Percy

shabbychef

unread,

Oct 29, 2010, 3:20:03 PM10/29/10

to MLcomp

I mocked up some benchmark plots for the data from the runs.txt file.
I first cut out all data sets which ran on fewer than 5 algorithms,
then I cut out all algorithms which ran on fewer than 15 of the
(remaining) datasets. This gives algorithms in only the 3 domains:
BinaryClassification, MulticlassClassification and Regression. I could
not upload the files to this group, so here they are in my docs:

https://docs.google.com/leaf?id=0B_fVxpoO16MMNGUyYzFkZDgtNTE0YS00ODdlLTgxYmQtMDViNTcxYmRlNjEy&hl=en&authkey=CO7hmr0J

Matlab does a terrible job of exporting plots, unfortunately. R would
be much better about this. To read benchmark plots, you want something
which is 'up and to the left', like a ROC curve, sort of. The X axis
is the ratio of error to the minimum error for that dataset across all
algos. Each line is an algo. The empirical CDFs are shown. If one was
viewing a point (2.3,0.85) on a line, one could say, then, that "for
the algorithm associated with this line, in 85% of the datasets
tested, the error was within 2.3 times the minimum error achieved for
that dataset by _any_ algorithm." I clip the x limits to 7 on the
left, otherwise there is a long tail on the right ;). The appearance
of not reaching the top of the graph indicates the algorithm can have
very poor performance on some datasets. A good 'generalist' algo would
have a line terminating at, say, (2.0,1.0), i.e. in 100% of the cases,
it did no worse than twice as bad as the best algorithm. A
'specialist' algorithm would be straight up to some point, say
(1.0,y), then 'hockey stick' right hard and perhaps end off the graph.

hth,

On Oct 29, 11:35 am, Percy Liang <pli...@cs.berkeley.edu> wrote:
> The problem with dividing by the minimum error and computing standard
> deviation is that for some datasets, the minimum error is 1e-4 and
> some errors are 1000. Divide and you get a huge number, which doesn't
> match my intuition of necessarily being a "good" dataset.
>
> Perhaps one should only take the top 50% of programs on a dataset.
> Intuition: a dataset is good if "reasonable" attempts at solving the
> problem have large variable. If some lame program gets huge error, we
> ought to not count that. Does this seem reasonable?
>
> Taking the log seems like a very good idea. This renders
> normalization by min or median and all the worries of sensitivity to
> scales in regression problems obsolete. Let E be a random variable
> denoting the error of a randomly chosen program on a dataset. Then
> var(\log E) = var(\log (c E)) for any scaling c. This seems like a
> nice property.
>
> I just changed the rating code and the site now reflects the new
> log-based rating system for datasets. (One has to worry a bit about
> corner cases when E = 0 is possible. I just set a floor of 1e-4 for
> now.) The numbers at least seem more reasonable...
>
> -Percy
>
>
>
>
>
>
>
> On Fri, Oct 29, 2010 at 11:15 AM, Percy Liang <pli...@cs.berkeley.edu> wrote:
> > Sorry, the link is:
> >http://mlcomp.org/download/runs.txt
>
> > -Percy
>

Percy Liang

unread,

Oct 29, 2010, 9:55:10 PM10/29/10

to mlc...@googlegroups.com

Thanks for doing the mockup - it's interesting! So most of the
algorithms are 2-competitive on 80% of the datasets, but there is this
long tail, where each algorithm has some dataset on which it works
pretty badly (presumably, different datasets). I would be really
interested to see how well an ensemble on these 9 algorithms would
perform...

-Percy

shabbychef

unread,

Oct 30, 2010, 1:31:44 AM10/30/10

to MLcomp

The graph for BinaryClassification is more clear, I think. The link to
google docs was kinda screwy. this should be it:
https://docs.google.com/leaf?id=0B_fVxpoO16MMYTMwZDAwZDMtY2IzYS00Mzk0LTg5NTQtMjEyNzRjYWI1YzI4&sort=name&layout=list&num=50
In this case, minimalist-boost is the clear 'winner', and svmlight-rbf
is looking like a turkey.

On Oct 29, 6:55 pm, Percy Liang <pli...@cs.berkeley.edu> wrote:
> Thanks for doing the mockup - it's interesting! So most of the
> algorithms are 2-competitive on 80% of the datasets, but there is this
> long tail, where each algorithm has some dataset on which it works
> pretty badly (presumably, different datasets). I would be really
> interested to see how well an ensemble on these 9 algorithms would
> perform...
>
> -Percy
>
>
>
> On Fri, Oct 29, 2010 at 12:20 PM, shabbychef <shabbyc...@gmail.com> wrote:
>
> > I mocked up some benchmark plots for the data from the runs.txt file.
> > I first cut out all data sets which ran on fewer than 5 algorithms,
> > then I cut out all algorithms which ran on fewer than 15 of the
> > (remaining) datasets. This gives algorithms in only the 3 domains:
> > BinaryClassification, MulticlassClassification and Regression. I could
> > not upload the files to this group, so here they are in my docs:
>

> >https://docs.google.com/leaf?id=0B_fVxpoO16MMNGUyYzFkZDgtNTE0YS00ODdl...

Percy Liang

unread,

Oct 30, 2010, 1:34:19 AM10/30/10

to mlc...@googlegroups.com

I get this when I clicked on the link:
Sorry, the page (or document) you have requested is not available.

shabbychef

unread,

Oct 30, 2010, 12:30:23 PM10/30/10

to MLcomp

the first link should have linked to a 'folder' with 3 plots in it,
but google is getting funny on me. try this for the
BinaryClassification graph:

https://docs.google.com/leaf?id=0B_fVxpoO16MMYTMwZDAwZDMtY2IzYS00Mzk0LTg5NTQtMjEyNzRjYWI1YzI4&sort=name&layout=list&num=50

On Oct 29, 10:34 pm, Percy Liang <pli...@cs.berkeley.edu> wrote:
> I get this when I clicked on the link:
> Sorry, the page (or document) you have requested is not available.
>
>
>
> On Fri, Oct 29, 2010 at 10:31 PM, shabbychef <shabbyc...@gmail.com> wrote:
> > The graph for BinaryClassification is more clear, I think. The link to
> > google docs was kinda screwy. this should be it:

> >https://docs.google.com/leaf?id=0B_fVxpoO16MMYTMwZDAwZDMtY2IzYS00Mzk0...

> ...
>
> read more »

Jacob Abernethy

unread,

Oct 30, 2010, 11:47:46 PM10/30/10

to mlc...@googlegroups.com

Nope, still no good.

Reply all

Reply to author

Forward