I am doing some data analysis for my Ph.D. and have a fairly
straightforward statistical methods question.
I have five algorithms that I'm comparing, using three different
performance metrics. There are six datasets involved in the
comparison. You can visualize this as a table that is five rows in
height (each algorithm) and 18 rows in width (6 datasets, three
scoring metrics each).
All of my data are scale/continuous data, and are scores on the
interval [0, 1].
Let's consider just one dataset. I have a score (this is a mean socre,
so it also has std. dev) for each of the five algorithms, on three
scoring metrics (15 values total).
One experimental question I have is, how well to the metrics
correlate? Do they measure the same thing or different things? One way
to look at this question is to rank the five algorithms based on
score, so now I have three rank vectors, one for each metric.
Are you familiar with any methods that can give a well founded notion
of how well these ranks correlate?
The same question applies to comparing the five algorithms using just
ONE metric over SIX datasets; the formulation is simiar and/or
identical. The final research question would then draw conclusions
from the ranking from all six datasets using all three performance
metrics.
Thanks,
--Mike
Ph.D. Candidate, December 2007
Each dataset consists of 100 "frames" against which the algorithm's
output is scored. So Algorithm A's performance on Dataset 1 is the
mean of its performance (Accoding to Metric M1) on each of the 100
frames in that dataset; the standard deviation is also reported from
this. The same is reported for Metrics M2 and M3, and then so on for
each of the six datasets.
So does your basic data look like this, where each "x" is a score,
and values in the same row are matched in the sense that they all
came from the same frame in the same data set, but that the frames
are not matched from one data set to another (i.e., in analysis-of-
variance terminology, frames are nested within data sets, and
crossed with algorithms and metrics)?
Data Alg. A Alg. B Alg. C Alg. D Alg. E
Set Frame M1 M2 M3 M1 M2 M3 M1 M2 M3 M1 M2 M3 M1 M2 M3
1 1 x x x x x x x x x x x x x x x
1 2 x x x x x x x x x x x x x x x
1 : : : : : : : : : : : : : : : :
1 100 x x x x x x x x x x x x x x x
: : : : : : : : : : : : : : : : :
6 1 x x x x x x x x x x x x x x x
6 2 x x x x x x x x x x x x x x x
6 : : : : : : : : : : : : : : : :
6 100 x x x x x x x x x x x x x x x
Thank you SO much for your assistance--I am grateful.
You have the general idea correct! Except in my rendering, ALG and
DataSet are transposed (Algs A-E are rows, Datasets are columns).
Further, I have condensed down the frames into a single value: the
mean and std. dev, so the nested rendering that you have provided is
thus compressed.
Dataset DS1 DS2 ... DS6
Metric M1 M2 M3 M1 M2 M3 ... M1 M2 M3
Alg
Alg. A x x x x x x x x x
Alg. B x x x x x x x x x
Alg. C x x x x x x x x x
Alg. D x x x x x x x x x
Alg. E x x x x x x x x x
The end goal is to come up with a final composite ranking of the
algorithms considering all of the data. This ranking of course answers
the key question, "Which algorithm is best?"
I have formulated my two research questions as follows:
1. Is the ranking of the algorithms dataset-dependent? (That is, given
some metric, does one algorithm perform best on all datasets, or does
the ranking of the algorithsm differ depending on the dataset?)
2. Is the ranking of the algorithms metric-dependent? (That is, given
a particular dataset, does one algorithm always perform best
regardless of the metric, or does the ranking of the algorithms differ
depending on the metric?)
Towards answering these questions, I checked out two books from the
library today:
Book 1 (1975)
Nonparametrics: statistical methods based on ranks E. L. Lehmann, with
the special assistance of H. J. M. D'Abrera
Book 2 (1999)
Nonparametric statistical methods 2nd Edition Myles Hollander, Douglas
A. Wolfe
However, I am having trouble adapting the methods here (specifically,
the elementary nonparametric ranking methods like the Wilcoxon Rank-
Sum method, etc.) to my particular research problem.
Any help or insight you could provide would be greatly appreciated.
Thank you,
--Mike
Here's one thought. Is it principled to take the average over ALL
frames in ALL datasets:
Dataset DS_ALL_AVG
Metric M1 M2 M3
Alg
Alg. A x x x
Alg. B x x x
Alg. C x x x
Alg. D x x x
Alg. E x x x
Importantly, I am not obfuscating the result by combining metrics, but
I still get an overall answer and ranking.
So one final question is: Consider that the rankings of the algorithms
M1, M2, and M3 are SIMILAR but NOT IDENTICAL. The Spearman test may or
may not give a statistical basis to say that they're independent. Even
if they are NOT independent--how do you obtain a final ranking of the
algorithms?
The other tricky part is applying the rank-oriented tests when your
values are means with confidence intervals, and the confidence
intervals overlap.
The most basic correlation is the within-condition correlation, which
would be calculated over the 100 frames of a given dataset. You would
have six 15 x 15 matrices of these, one matrix for each dataset. If
the data is organized as in my previous post:
Algorithm ..A.. ..B.. ..C.. ..D.. ..E..
Metric 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 ,
each correlation matrix could be partitioned into 3 x 3 submatrices.
The diagonal blocks would tell how similar the results of the three
metrics are for each algorithm. (But note that this similarity refers
only to the *patterns* of the scores for the 100 frames, because
correlation ignores differences in level and variability.) The
diagonal elements of the offdiagonal blocks, or what would be the
elements of the 5 x 5 diagonal blocks, if you reorganized the data as
Metric ....1.... ....2.... ....3....
Algorithm A B C D E A B C D E A B C D E ,
tell how similar the score patterns of the different algorithms are
for each metric.
However you choose to organize them, you should inspect the six
within-condition correlation matrices to see the extent to which they
differ from one dataset to another. (Some people prefer to compare
the within-condition covariance matrices as well, or even instead.)
To get an average within-condition correlation matrix, it is
customary to pool the within-condition covariance matrices (i.e.,
to average them, weighting by their degrees of freedom), and then
convert the pooled covariance matrix to a correlation matrix in the
usual way (i.e., by dividing each offdiagonal element by the square
root of the product of the corresponding diagonal elements).
Those are within-condition correlations. There are two other types.
You get a single 15 x 15 matrix of between-condition correlations by
correlating the mean scores over the datasets (again with weighting
-- this time, weighting by the sample sizes).
You get a single 15 x 15 matrix of overall correlations by stacking
the six 100 x 15 data matrices to get a 600 x 15 matrix, and then
correlating over the 600 frames. (The algebraic relations among the
three kinds of matrices parallel the relation of the total sum of
squares to the between-condition and within-condition sums of squares
in analysis of variance.)
The between-condition and within-condition correlations can be very
different from one another; the overall correlation is somewhere in
between. One of the standard mistakes is to expect the between- and
within- correlations to be the same. There is no reason why one could
not be strongly positive and the other strongly negative. Another
standard mistake is to calculate one and interpret it as if it were
the other. They are conceptually different; they reflect different
sources of variability.
There are several ways to get correlations that address the question
"how similar are the metrics?". For each dataset, you could stack the
data for the algorithms to get a 500 x 3 matrix, and then get six
3 x 3 correlation matrices. Or for each algorithm, you could stack
the data to get a 600 x 3 matrix, and then get five 3 x 3 correlation
matrices. Or you could stack the algorithms and datasets to get a
3000 x 3 matrix, from which you get a single 3 x 3 correlation
matrix. Each matrix would have a different interpretation.
All that was with actual scores. I've run out of steam for now.
I'll try to deal with ranking later.
A very happy Thanksgiving to you!
Take care,
--Mike
I've realized that I don't see why you need to use ranks. Before I
can make suggestions on how to use them best, I need to understand
why you would be dissatisfied with doing everything in terms of the
actual scores and their means, standard deviations, correlations,
etc. What information would be missing or distorted?
Also, it might help if you said something about the whole situation
-- what the general problem is, what the algorithms are doing, what
the metrics are, how the datasets and frames were chosen, etc.