AUROC and C-statistic

Abhaya Indrayan

unread,

Jan 9, 2022, 5:05:28 AM1/9/22

to MedS...@googlegroups.com

Can somebody explain to me the difference between area under the ROC curve and the C-statistic? Are they the same?

Thanks.

~Abhaya

--

Dr Abhaya Indrayan,

Personal website: http://indrayan.weebly.com

Abhaya Indrayan

unread,

Jan 9, 2022, 5:25:30 AM1/9/22

to MedS...@googlegroups.com

I mean for a binary outcome.

--
--
To post a new thread to MedStats, send email to MedS...@googlegroups.com .
MedStats' home page is http://groups.google.com/group/MedStats .
Rules: http://groups.google.com/group/MedStats/web/medstats-rules

---
You received this message because you are subscribed to the Google Groups "MedStats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medstats+u...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/medstats/CAP7G4a7yFODXO27ZzuNOPAOnb65HpOjF2jmYZdp3x%3DNP9WCSRw%40mail.gmail.com.

--

Dr Abhaya Indrayan, MSc,MS,PhD(OhioState),FSMS,FAMS,FRSS,FASc

Personal website: http://indrayan.weebly.com

Bruce Weaver

unread,

Jan 9, 2022, 10:47:25 AM1/9/22

to MedStats

The following paragraph is quoted from Conroy (2012).

Perhaps as a result, the literature is now replete with statistics that are nothing
other than Mann–Whitney statistics under other names. Bross (1958), setting out the
calculation and use of ridit statistics, noted that the mean ridit score for a group was the
probability that an observation from that group would be higher than an observation
from a reference population. Harrell’s C statistic, which is a measure of the difference
between two survival distributions, is a special case of the Mann–Whitney statistic,
and indeed, in the absence of censored data, it reduces to the Mann–Whitney statistic
(Koziol and Jia 2009). Likewise, the tendency to refer to the Mann–Whitney statistic
as the area under the receiver operator characteristic curve is common in literature
evaluating diagnostic and screening tests in medicine, and is extremely unhelpful. The
name entirely obscures what the test actually tells us, which is the probability that a
person with the disorder or condition will score higher on the test than a person without
it. The area under the receiver operator characteristic curve has been proposed as a
measure of effect size in clinical trials (Brumback, Pepe, and Alonzo 2006), which would
extend the bafflement to a new population of readers.

Conroy, R. M. (2012). What hypotheses do “nonparametric” two-group tests actually test?. The Stata Journal, 12(2), 182-190.

https://journals.sagepub.com/doi/abs/10.1177/1536867X1201200202

HTH.

Abhaya Indrayan

unread,

Jan 9, 2022, 10:33:26 PM1/9/22

to MedS...@googlegroups.com

Thanks, Bruce. I later on added 'for binary outcome'. Thus, I do not want to consider survival durations.

My question is limited to the interpretation of the two terms (AUROC and C-statistic) which seem to be interchangeably used. I can see that AUROC is the probability that a person with the disorder or condition will score higher on the test than a person without

it but your quote from Conroy does not clarify what C-statistic measures for binary outcomes. I may have missed the basics.

This says that the tendency to refer to the M-W statistic as the AUROC curve is extremely unhelpful but does not give reasons. It would be interesting to know why this parallel is drawn. I see a large number of papers these days that use AUROC for inference on the utility of their models.

Regards.

~Abhaya

--

--
To post a new thread to MedStats, send email to MedS...@googlegroups.com .
MedStats' home page is http://groups.google.com/group/MedStats .
Rules: http://groups.google.com/group/MedStats/web/medstats-rules

---
You received this message because you are subscribed to the Google Groups "MedStats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medstats+u...@googlegroups.com.

To view this discussion on the web, visit https://groups.google.com/d/msgid/medstats/0d644e07-8f54-450f-a0fe-82a589300ee1n%40googlegroups.com.

Prof Sada Nand Dwivedi

unread,

Jan 10, 2022, 2:04:29 AM1/10/22

to meds...@googlegroups.com

Yes, we have used Harrel's C statistic in my PhD students' works. If I recall accurately, to my understanding, in case of binary outcome, both are same, as concordance probability.

Thanks and regards

SN Dwivedi

S.N. Dwivedi, Ph.D., FSMS, FRSS (UK)

Professor, Department of Biostatistics

All India Institute of Medical Sciences

Ansari Nagar

New Delhi-110029, India

Tel: 91-11-26588441 (Residence)

91-11-26593394 (Residence)

91-11-26593387 (Office)

91-9810571956

91-9868397937

Other Emails: dwiv...@hotmail.com

dwiv...@aiims.edu

dwi...@aiims.ac.in

dwiv...@yahoo.com

To view this discussion on the web, visit https://groups.google.com/d/msgid/medstats/CAP7G4a7DqJH8RShcey53VcAsu3P4-TXdBkdJRXhzwmAsZc5qLQ%40mail.gmail.com.

Bruce Weaver

unread,

Jan 10, 2022, 11:59:02 AM1/10/22

to MedStats

I did see your second post emphasizing binary outcome, Abhaya. Please see the following example generated using Stata.

. // Example showing how to compute AUC using Stata's -roctab- command
. clear

. webuse hanley

. generate byte nodisease = !disease

. tabulate disease nodisease

true |
disease |
status of | nodisease
subject | 0 1 | Total
-----------+----------------------+----------
0 | 0 58 | 58
1 | 51 0 | 51
-----------+----------------------+----------
Total | 51 58 | 109

. // Generate AUC value via -roctab-
. roctab disease rating

ROC -Asymptotic Normal--
Obs Area Std. Err. [95% Conf. Interval]
------------------------------------------------------------
109 0.8932 0.0307 0.83295 0.95339

. // Now use -ranksum- command with reverse-coded disease variable
. // to get the porder statistic, and notice that it equals AUC
. ranksum rating, by(nodisease) porder

Two-sample Wilcoxon rank-sum (Mann-Whitney) test

nodisease | obs rank sum expected
-------------+---------------------------------
0 | 51 3968 2805
1 | 58 2027 3190
-------------+---------------------------------
combined | 109 5995 5995

unadjusted variance 27115.00
adjustment for ties -2116.86
----------
adjusted variance 24998.14

Ho: rating(nodise~e==0) = rating(nodise~e==1)
z = 7.356
Prob > |z| = 0.0000
Exact Prob = 0.0000

P{rating(nodise~e==0) > rating(nodise~e==1)} = 0.893

. // Now use Roger Newson's -somersd- command with original disease variable
. somersd disease rating, transf(c) tdist
Somers' D with variable: disease
Transformation: Harrell's c
Valid observations: 109
Degrees of freedom: 108

Symmetric 95% CI for Harrell's c
------------------------------------------------------------------------------
| Jackknife
disease | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
rating | 0.893 0.031 28.942 0.0000 0.832 0.954
------------------------------------------------------------------------------

Here is the code (without intervening output) for anyone who wants it.

// Example showing how to compute AUC using Stata's -roctab- command
clear
webuse hanley
generate byte nodisease = !disease
tabulate disease nodisease
// Generate AUC value via -roctab-
roctab disease rating
// Now use -ranksum- command with reverse-coded disease variable
// to get the porder statistic, and notice that it equals AUC
ranksum rating, by(nodisease) porder
// Now use Roger Newson's -somersd- command with original disease variable
somersd disease rating, transf(c) tdist

Cheers,

Bruce

Karl Ove Hufthammer

unread,

Jan 10, 2022, 2:25:56 PM1/10/22

to meds...@googlegroups.com

Yes, the AUROC for a binary outcome is the same as the C-index =
Harrell’s C-index = the C-statistic = a (certain) concordance probability.

Here’s a quick explanation:

Setting: You have a binary outcome variable and continuous variable,
e.g., a risk score (which you hope can predict the binary outcome).

Then you can do either of the following:

1: AUROC
For the binary ‘successes’, look at all possible values
of the continuous variable (also including ±∞).
Treat these values as thresholds, and for each threshold,
calculate the proportion where the corresponding continuous
variable is ≥ x. This is the estimated sensitivity of
a test that predicts binary ‘success’ when the corresponding
risk score is ≥ x and ‘failure’ when it is < x.

Do the same for the binary ‘failures’ for each threshold x.
This is the estimated specificity of the test (for the
given threshold x).

No you have the sensitivity and specificity for various thresholds.
Plot sensitivity on the y axis and 1 − specificity on the x axis.
This is the ROC curve (Receiver Operating Characteristic curve).
The area under the curve (from 0 to 1), AUC, is the AUROC
(should really be called the AUROCC …). If the risk score isn’t
useful for predicting the binary variable, the curve will approx.
lie on the diagonal from the lower left to the upper right,
so the AUROC will be 0.5. If it’s useful, it will lie above this
diagonal, so the AUROC will be > 0.5.

2: Concordance probability (C-index)
Look at all ‘success’/‘failure’ observation pairs in the data.
Calculate the proportions of concordant pairs, i.e., the pairs
where the ‘success’ observation has a higher value on the
continuous variable than the ‘failure’ observation (ignore ties for now).
This is the concordance probability (or C-index or C-statistic).
If the risk score isn’t useful for predicting the binary variable,
this value will be approx. 0.5. If it’s useful, it will be > 0.5
And this value happens to be exactly the same as the AUROC above!

Additional details:

1. For logistic regression, i.e. multiple predictor variables, use the
   linear predictor as the risk score.

2. For Cox regression with some censored data (and for similar survival models)
   you can also the use linear predictor, but note that not all pairs provide
   useful data. For example, if death is defined as success and you observe
   time 3 for patient A and time 5 for patient B, there are four possible
   censored/non-censored combinations:

   A died at time 3 and B died at time 5:             B lived longer than A.
   A died at time 3 and B was censored at time 5:     B lived longer than A.
   A was censored at time 3 and B died at time 5:     No information on who lived the longest.
   A was censored at time 3 and B censored at time 5: No information on who lived the longest.

   For all useful pairs, calculate the proportion of concordant pairs,
   i.e., pairs where the person who lived the longest also had the highest risk score.
   Again, this is Harrell’s C, a measure of concordance.
   (If people talk about Harrell’s C, they usually talk about survival data.)

3. Regarding ties: If the continuous variable was really continuous,
   ties would have probability 0. In practice, there will often be ties.
   Treat them as ‘half a concordance’ when calculating the proportion of
   concordant pairs. This is equivalent to using linear interpolation
   for points corresponding to the observed thresholds in the ROC curve
   (i.e., using the trapezoidal rule for calculating the area under the curve).

4. The Wilcoxon–Mann–Whitney test is also equivalent! This is a test of
   the null hypothesis that the concordance probability is exactly 0.5,
   i.e., that for a random ‘success’/‘failure’ pair with corresponding
   risk scores (X,Y), P(Y > X) = 0.5 (or if you want
   to handle ties, P(Y > X) + 0.5 × P(Y = X)). Or, equivalently,
   that the population version of the ROC curve is a diagonal line
   from the bottom left to the top right in the [0,1] square.

5. I said that the Wilcoxon–Mann–Whitney test measures the concordance
   probability P(Y > X). Now, P(Y > X) = P(Y - X > 0). If this is 0.5,
   then 0 is by definition the median of the distribution of Y - X.
   So the Wilcoxon–Mann–Whitney is also a test of the null hypothesis
   that the median difference of risk scores between a *random*
   ‘success’/‘failure’ pair is 0.
   Note that it’s a test of the median of differences, not a test of
   differences in medians, as is unfortunately often said (and taught)!

6. And if you have survival data without censoring, the log-rank
   test is also equivalent to this test.
   (Note that above methods only depend on the ranks of the continuous
   variable, not on the exact values.)

7. Regarding ties again: There are various variants of the log-rank
   test, the Wilcoxon–Mann–Whitney test etc., which differ in the handling
   of ties, in the use of approximations to the sampling distribution of
   the test statistic, and in the definition of the test statistic
   (and different software applications use different defaults).
   But the tests and statistics are basically all the same
   (the test statistics are just monotone transformations of each other),
   and there are variants where the tests give the exact same test statistic,
   the AUROC/C-statistic, and (naturally) the exact same P-values.

8. To sum up, for binary data, AUROC = Harrell’s C-index, and the
   corresponding tests are equivalent to the Wilcoxon–Mann–Whitney test.
   For survival data, you can also calculate the C-index, even if
   you have censored data. And for non-censored data, the log-rank
   test is equivalent.

9. All this is easier to understand with some graphics. See this
post, where I explain in more detail how to manually calculate
the AUROC and the C-index: https://stats.stackexchange.com/a/146174

Karl Ove Hufthammer

To view this discussion on the web, visit https://groups.google.com/d/msgid/medstats/CAP7G4a74Tt9YJP47%2Brj%3DXq5v_CQeFRttoRzS4gw0hy66W2QZHg%40mail.gmail.com.

Abhaya Indrayan

unread,

Jan 10, 2022, 8:15:06 PM1/10/22

to MedS...@googlegroups.com

Thanks to Bruce and Karl for their detailed and convincing explanations. Those are indeed helpful and confirm that AUROC curve and C-statistic for a binary outcome are the same although the method of calculation may be different.

~Abhaya

Bruce Weaver
Jan 10, 2022, 10:29 PM (7 hours ago)
to MedStats

-

Reply all

Reply to author

Forward