Yes, the AUROC for a binary outcome is
the same as the C-index =
Harrell’s C-index = the C-statistic = a (certain) concordance
probability.
Here’s a quick explanation:
Setting: You have a binary outcome variable and continuous
variable,
e.g., a risk score (which you hope can predict the binary
outcome).
Then you can do either of the following:
1: AUROC
For the binary ‘successes’, look at all possible values
of the continuous variable (also including ±∞).
Treat these values as thresholds, and for each threshold,
calculate the proportion where the corresponding continuous
variable is ≥ x. This is the estimated sensitivity of
a test that predicts binary ‘success’ when the corresponding
risk score is ≥ x and ‘failure’ when it is < x.
Do the same for the binary ‘failures’ for each threshold x.
This is the estimated specificity of the test (for the
given threshold x).
No you have the sensitivity and
specificity for various thresholds.
Plot sensitivity on the y axis and 1 − specificity on the
x axis.
This is the ROC curve (Receiver Operating Characteristic curve).
The area under the curve (from 0 to 1), AUC, is the AUROC
(should really be called the AUROCC …). If the risk score isn’t
useful for predicting the binary variable, the curve will approx.
lie on the diagonal from the lower left to the upper right,
so the AUROC will be 0.5. If it’s useful, it will lie above this
diagonal, so the AUROC will be > 0.5.
2: Concordance probability (C-index)
Look at all ‘success’/‘failure’ observation pairs in the data.
Calculate the proportions of concordant pairs, i.e., the pairs
where the ‘success’ observation has a higher value on the
continuous variable than the ‘failure’ observation (ignore ties
for now).
This is the concordance probability (or C-index or C-statistic).
If the risk score isn’t useful for predicting the binary variable,
this value will be approx. 0.5. If it’s useful, it will be >
0.5
And this value happens to be exactly the same as the AUROC above!
Additional details:
1. For logistic regression, i.e. multiple predictor variables, use
the
linear predictor as the risk score.
2. For Cox regression with some censored data (and for similar
survival models)
you can also the use linear predictor, but note that
not all pairs provide
useful data. For example, if death is defined as success and
you observe
time 3 for patient A and time 5 for patient B, there are four possible
censored/non-censored combinations:
A died at time 3 and B died at time 5: B lived
longer than A.
A died at time 3 and B was censored at time 5: B lived
longer than A.
A was censored at time 3 and B died at time 5: No
information on who lived the longest.
A was censored at time 3 and B censored at time 5: No
information on who lived the longest.
For all useful pairs, calculate the proportion of concordant
pairs,
i.e., pairs where the person who lived the longest also had the
highest risk score.
Again, this is Harrell’s C, a measure of concordance.
(If people talk about Harrell’s C, they usually talk about
survival data.)
3. Regarding ties: If the continuous variable was really
continuous,
ties would have probability 0. In practice, there will often be
ties.
Treat them as ‘half a concordance’ when calculating the
proportion of
concordant pairs. This is equivalent to using linear
interpolation
for points corresponding to the observed thresholds in
the ROC curve
(i.e., using the trapezoidal rule for calculating the area
under the curve).
4. The Wilcoxon–Mann–Whitney test is also equivalent! This is a
test of
the null hypothesis that the concordance probability is exactly
0.5,
i.e., that for a random ‘success’/‘failure’ pair with
corresponding
risk scores (X,Y), P(Y > X) = 0.5 (or if you want
to handle ties, P(Y > X) + 0.5 × P(Y = X)). Or,
equivalently,
that the population version of the ROC curve is a diagonal line
from the bottom left to the top right in the [0,1] square.
5. I said that the Wilcoxon–Mann–Whitney test measures the
concordance
probability P(Y > X). Now, P(Y > X) = P(Y - X > 0). If
this is 0.5,
then 0 is by definition the median of the distribution
of Y - X.
So the Wilcoxon–Mann–Whitney is also a test of the null
hypothesis
that the median difference of risk scores between a
*random*
‘success’/‘failure’ pair is 0.
Note that it’s a test of the median of differences, not
a test of
differences in medians, as is unfortunately often said
(and taught)!
6. And if you have survival data without censoring, the
log-rank
test is also equivalent to this test.
(Note that above methods only depend on the ranks of
the continuous
variable, not on the exact values.)
7. Regarding ties again: There are various variants of the
log-rank
test, the Wilcoxon–Mann–Whitney test etc., which differ in the
handling
of ties, in the use of approximations to the sampling
distribution of
the test statistic, and in the definition of the test
statistic
(and different software applications use different defaults).
But the tests and statistics are basically all the same
(the test statistics are just monotone transformations of each
other),
and there are variants where the tests give the exact same test
statistic,
the AUROC/C-statistic, and (naturally) the exact same P-values.
8. To sum up, for binary data, AUROC = Harrell’s C-index, and the
corresponding tests are equivalent to the Wilcoxon–Mann–Whitney
test.
For survival data, you can also calculate the C-index, even if
you have censored data. And for non-censored data, the log-rank
test is equivalent.
Karl Ove Hufthammer