I am working with release 5.0.2 and when running Kaplan-Meier I try
to get a Hazard plot. In the help system it is described as the
probability per unit time to experience the terminal event at time t
if the individual has survived until time t. What I then get is a plot
with the y-axis title "cumulative hazard".
My problem is that I don't know what the cumulative hazard does
represent. Since my plot ends with values above 1.5 at time 72
I don't assume that it means "the probability that an individual
will experience the event between time 0 and time t". Could
someone tell me how this cumulative hazard is calculated (it is
not a sum of the hazards from time 0 to time t, I suppose) and
what it's used for?
Thanks in advance.
-m-a-s-
The best definition of a hazard rate I've seen comes from David
Kleinbaum's _Survival Analysis: A Self-Learning Text_ (Springer-Verlag,
1996) p. 10: "The hazard function _h(t)_ gives the instantaneous
potential per unit time for the event to occur, given that the individual
has survived up to time _t_." He also refers to it as a conditional
failure rate.
The cumulative hazard plotted in the K-M procedure is mathematically
defined as the negative of the natural logarithm of the survival
function. Thus, when the survival function goes below 1/e (e being
the natural exponential constant), the cumulative hazard will be
above 1.
Clear explanations of these concepts are not exactly a staple of the
survival analysis literature, particularly with regard to conceptual
interpretation in any way abstracted from the mathematical definitions.
A number of books mislead people into thinking that a hazard rate is
a probability.
--
-----------------------------------------------------------------------------
David Nichols Senior Support Statistician SPSS, Inc.
Phone: (312) 329-3684 Internet: nic...@spss.com Fax: (312) 329-3668
-----------------------------------------------------------------------------
[...]
>>with the y-axis title "cumulative hazard".
>>
>>My problem is that I don't know what the cumulative hazard does
>>represent. Since my plot ends with values above 1.5 at time 72
[...]
>>someone tell me how this cumulative hazard is calculated (it is
>>not a sum of the hazards from time 0 to time t, I suppose) and
>>what it's used for?
>
>The best definition of a hazard rate I've seen comes from David
>Kleinbaum's _Survival Analysis: A Self-Learning Text_ (Springer-Verlag,
>1996) p. 10: "The hazard function _h(t)_ gives the instantaneous
>potential per unit time for the event to occur, given that the individual
>has survived up to time _t_." He also refers to it as a conditional
>failure rate.
>
>The cumulative hazard plotted in the K-M procedure is mathematically
>defined as the negative of the natural logarithm of the survival
>function. Thus, when the survival function goes below 1/e (e being
>the natural exponential constant), the cumulative hazard will be
>above 1.
This is fine for answering the question of what a cumulative hazard
is, but it doesn't answer the question of what it's used for. I quote
from Kiefer, "Economic Duration Data and Hazard Functions," from the
June 1988 issue of _The Journal of Economic Literature_, p. 652, (note
that Kiefer calls it the "integrated hazard" rather than the "cumulative
hazard"):
The "integrated hazard" [integral from 0 to t of the hazard
function] is also a useful function in practice. It is the
basic ingredient in a variety of specification checks. The
integrated hazard does not have a convenient interpretation
however. In particular, note that it is not a probability.
This doesn't give a full answer to the question either, but points
in the right direction: the cumulative hazard function as far as I can
tell is of little interest in itself, or for characterizing the data.
However, graphs of the integrated hazard are used for specification
tests -- i.e. tests to see if one's functional forms and
other assumptions are appropriate. For example, (and I may be
garbling the statistical theory and results here, so forgive me if I
get this part wrong) I believe that if one's model is correctly
specified, a graph of the integrated hazard function should be a straight
45 degree line from the origin. From the parameter estimates and actual
data, one calculates something called generalized residuals (which for
regression models of duration data are the integrated hazard) and graphs
them to see how closely they adhere to the 45 degree line. If your
generalized residuals deviate "too far" (unfortunately I don't know if
there is an exact definition of how to measure how far is "too far")
then your specification is probably incorrect -- e.g. you may be lacking
an important explanatory variable, creating heterogeneity in your data;
or your functional form may be incorrect.
I think survival analysis is a technique of growing importance in
the social sciences (indeed I just wrote an informal paper about it);
SPSS is gradually beefing up its survival analysis but it still seems to
be several years behind the times. I don't have ver 7.5 but I'm glad
to hear that it seems to do plots of the cumulative hazard -- the
literature has been talking about generalized residuals and
specification checking for a good 15 years now, but I have not yet seen
a statistical program which will do these plots automatically. Even
Limdep, which of the software packages I've seen has the most advanced
survival analysis programs, will only do survival and hazard plots, not
cumulative hazard plots.
>Clear explanations of these concepts are not exactly a staple of the
>survival analysis literature, particularly with regard to conceptual
Yup, however I think as survival analysis grows in popularity we will
see more and more articles which are written clearly and in an easier to
understand fashion. Some psychologists (Willett and Singer) have already
started to do so and an Institutional Researcher (Ronco) recently published
an "intro to survival analysis" article.
>interpretation in any way abstracted from the mathematical definitions.
>A number of books mislead people into thinking that a hazard rate is
>a probability.
True enough although that's sort of like saying that a density is
not a probability. Which is quite true, and densities can have values
greater than 1, but for me personally I find that my best intuitive
notion of what a density is, is that it's sort of like a probability.
And a hazard function is sort of like a conditional probability. If one
has discrete variables (and for that matter, data are always discrete),
it IS a probability.
--Mike Tamada
Occidental College
tam...@oxy.edu
SPSS has been plotting cumulative hazard estimates for survival models
in the K-M and COX REG procedures since release 5.0.2, about five years
ago.
While data we use in practice may be measured to finite precision and
for that reason not truly continuous, many of the models we use assume
that the underlying data is continuous. For that reason, keeping the
distinctions between things like densities and probabilities clear is
very important. I certainly wouldn't want to try to explain to a class
why a hazard rate can be greater than 1 if I've told them they can
think of it as a probability.