We don't currently have anything built in for ROC curves. I know that David
Matheson has worked up some syntax for this, and I understand that he'll
post this. I assume that by the C statistic you mean the one that gives the
area under the ROC curve by the trapeziodal method, not a version of the
Hosmer-Lemeshow goodness of fit test, which also goes by C in their book.
A Hosmer-Lemeshow statistic is available in the LOGISTIC REGRESSION procedure
in Release 7.5. The other C statistic is not currently available in SPSS.
Theoretically, it's not difficult to compute, but practically, it's not
trivial. You need to compare the predicted probability for each 1 response
to that of every 0 response, and the proportion of those that are higher
(with half and half for ties) is the C statistic.
--
-----------------------------------------------------------------------------
David Nichols Senior Support Statistician SPSS, Inc.
Phone: (312) 329-3684 Internet: nic...@spss.com Fax: (312) 329-3668
-----------------------------------------------------------------------------
>----------
>From: Colleen Norris[SMTP:c...@tachy.uah.ualberta.ca]
>Sent: Tuesday, April 29, 1997 11:00 AM
>To: Multiple recipients of list SPSSX-L
>Subject: Logistic regression 'C' statistic
>
>Hello,
>How does one figure out the C statistic for a logistic regression model in
>SPSS? I am running SPSS 7.5. Also, any assistance graphing ROC curves would
>be
>much appreciated!
>Thanks
>Colleen
As David Nichols mentioned in a previous post, I have developed some
syntax
for drawing ROC curves and calculating the area underneath them. This is
very much
a work in progress. The code provided below works to draw the curve and
calculate the area
with results that match published results for the example. I've applied
similar code
for other examples, some of which involved thresholds on a diagnostic
measurement,
rather than predicted probabilities from logistic regression. Hopefully,
this code will soon
evolve into a macro. Other extensions might involve plotting more than 1
curve and comparing
the area under curves from independent or paired samples. I'm certainly
open to suggestions
or corrections from the list members.
I have provided some SPSS syntax below to draw an ROC curve and
calculate the area
underneath the curve. This example starts with reading the data,
restructuring it
to a form that SPSS Logistic Regression can work with. The example is
from the
SAS manual, "Logistic Regression Examples Using the SAS System", which
is cited more
fully below. I used this data so that I could compare my calculation of
the area with the
c statistic reported by SAS (which was .831). The response variable was
the event of perinatal mortality, with 3 binary covariates (mother's
smoking rate (more than 5/day), age (30+),
and gestation period (261+ days)). Logistic regression was run and the
predicted
probabilities were saved as prob. In the SAS example, the cutpoints for
the ROC
appear to be set at the predicted probabilities (from PROC LOGISTIC) for
the 8 covariate
patterns, which range from .006 to .2854 and are quite irregularly
spaced. To preserve
comparability and still facilitate a simple and efficient looping
structure, I autorecoded
the predicted probabilities from Logistic Regression into a variable
called probcut.
For each weighted case (which comprises a particular response, covariate
pattern, and
caseweight), a new case was created for each cutpoint, with the case's
status as a
true positive, false positive true negative or false negative stored in
the 2 variables HIT and
FPOS. When the resulting file is aggregated by cutpoint, the sensitivity
and false positive
proportions are available for each cutpoint.
I used the trapezoidal integration method directly to calculate the
area under the ROC.
This involved calculating the area in each cutpoint 'case' in the
aggregate file and using
the lag function to increment the cumulative area from the previous case
by this amount.
The result for this example was .8299, compared to SAS' c statistic
value of .831.
In an example such as this, where every observed probability falls at a
cutpoint,
one could use a similar aggregate file to calculate c as defined by SAS.
The W
statistic in Hanley & McNeil (1982)* corresponds to c, if I interpret it
correctly, and
the aggregate command in the code below could be modified to calculate
case counts
of true positives, etc. to allow the calculations in Table II of Hanley
& McNeil. The advantage
of this approach may be the ability to then apply their calculations for
standard error of the
area.*(Hanley, James A. & McNeil, Barbara J. (1982). The meaning and use
of the area
under a receiver operating characcteristic (ROC) curve, Diagnostic
Radiology, 143(1), 29-36.)
For comparison, I added another method for calculating the area under
the curve,
which was provided on this list by C.R.Gould (if I'm reading the right
part of the mail header).
ROC area = 1 - U/[n(group1) * n(group2)]
where U is the Mann-Whitney U statistic, and n(group1) and n(group2) are
the sample
sizes for the event and nonevent groups (deaths and nondeaths in this
example).
All 3 components are provided by running NPAR TESTS in SPSS. I included
the
NPAR TEST run just after Logistic Regression in the code below. The
predicted probabilities
are the dependent variable for this procedure. I used a calculator to
apply the NPAR
output to the area formula with a result of .8299, which is identical to
the trapezoidal
integration results and very close to the c statistic from the SAS
results.
David Matheson
SPSS Technical Support
********************************************************************
* SPSS program to run logistic regression and draw a nonparametric
* ROC curve for Wermuth's perinatal mortality data , as
presented in
* Example 10 in SAS Institute Inc., (1995). 'Logistic Regression
Examples Using
* the SAS System, Version 6. First Edition', Cary NC: SAS
Institute Inc. (pp. 87-92).
* and Exercise 5.17 in McCullagh & Nelder, (1989). 'Generalized
Linear Models (2nd)',
* Chapman & Hall (p190)
.
title 'Perinatal Mortality Data Logistic Regression and ROC'.
data list free / deaths tbirths cigs age gestpd.
begin data.
50 365 1 1 1
9 49 2 1 1
41 188 1 2 1
4 15 2 2 1
24 4036 1 1 2
6 465 2 1 2
14 1508 1 2 2
1 125 2 2 2
end data.
* restructure the data so that deaths (resp = 1) and nondeaths (resp=0)
are represented by separate cases for each covariate pattern.
compute resp = 1 .
compute wt = deaths.
save outfile = temp1.sav / keep resp wt cigs age gestpd.
compute resp = 0 .
compute wt = tbirths - deaths.
save outfile = temp2.sav /keep resp wt cigs age gestpd.
add files /file = temp1.sav /file = temp2.sav.
execute.
weight by wt.
LOGISTIC REGRESSION VAR=resp
/METHOD=ENTER cigs age gestpd
/SAVE PRED (prob)
/CRITERIA PIN(.05) POUT(.10) ITERATE(20).
* I include the Mann-Whitney test for comparison here.
* calculate area as 1 - (U/(N_0*N_1)), where U is the Mann-Whitney U,
* N_0 and N_1 are the sample sizes for 0s and 1s on the logistic
regression
* response variable .
NPAR TESTS
/M-W= prob BY resp(0 1)
/MISSING ANALYSIS.
* autorecode to create integer cut point for each predicted probability
value.
* if you have a single predictor , you could loop through those values
to set cut.
autorecode variables = prob / into probcut / print.
loop #i = 1 to 8.
compute cut = #i.
* creating probval allows you to see prob(event) associated with a cut .
if (probcut = cut) probval = prob.
if (probcut ne cut) probval = $sysmis.
* call = 1 if event predicted at this cut; 0, otherwise .
compute call = (probcut ge cut) .
if (resp = 1) hit = call .
if (resp ne 1) fpos = call .
xsave outfile = roctemp.sav /keep = cut probval hit fpos.
end loop.
execute.
* note that the weighting status of wt is maintained.
get file roctemp.sav .
aggregate outfile = *
/break = cut /probval = first(probval)
/phit pfpos = mean(hit fpos).
sort cases by pfpos phit (a) cut (d).
*calculate and print area under the curve(s).
do if (missing(lag(cut))) .
+ compute cumarea = 0.
else.
+ compute cumarea = lag(cumarea) +
(pfpos - lag(pfpos)) * (phit + lag(phit))/2 .
end if.
formats pfpos phit cumarea (f8.6).
variable labels phit 'Sensitivity' pfpos '1 - Specificity'.
* listing of prob(event), 1- Specificity (false positive) ,
Sensitivity, and
cumulative area for each cutpoint .
list cut probval pfpos phit cumarea.
Report
/FORMAT= CHWRAP(ON) PREVIEW(OFF) CHALIGN(BOTTOM) UNDERSCORE(ON)
ONEBREAKCOLUMN(OFF) CHDSPACE(1) SUMSPACE(0) AUTOMATIC NOLIST
BRKSPACE(0)
PAGE(1) MISSING'.' LENGTH(1, 59) ALIGN(LEFT) TSPACE(1) FTSPACE(1)
MARGINS(1,11)
/VARIABLES cumarea (VALUES) (RIGHT) (OFFSET(0))(8)
/BREAK (TOTAL) 'Area Under ' (SKIP(1))
/SUMMARY MAX( cumarea) SKIP(1) 'Curve' .
* draw the ROC graph .
* compute d1 and d2 at (0,0) and (1,1) to draw diagonal on ROC .
* there will be 1 point at (1,1) and others at (0,0)
* as long as the logical test involves one observed cut point.
compute d1 = (cut = 1).
compute d2 = (cut = 1).
GRAPH
/SCATTERPLOT(OVERLAY)=pfpos d2 WITH phit d1 (PAIR) BY cut (IDENTIFY)
/MISSING=VARIABLEWISE
/TITLE= 'ROC for Morbitity Log. Reg. Example'.
* From here, you have to edit the chart to scale and label axes,
draw line thru ROC points and on diagonal, and hide the legend.
* You can use the Point identification icon in the chart editor
to label those optimum cut points.
* if you don't want the diagonal line in the ROC graph,
make a simple scatterplot as below and the variable labels will
appear on
the axes.
* You will still need to edit the chart to scale axes and connect the
ROC points.
GRAPH
/SCATTERPLOT = pfpos WITH phit BY cut (IDENTIFY)
/MISSING=VARIABLEWISE
/TITLE= 'ROC for Morbidity Log. Reg. Example'.
>