Gta Censored

0 views

Skip to first unread message

Arnaude Kubiak

unread,

Aug 3, 2024, 5:43:26 PM8/3/24

to behamanec

Within the general framework of chemical risk assessment, a difficult step in dietary exposure assessment is the handling of concentration data reported to be below the limit of detection (LOD). These data are known as non-detects and the resulting distribution of occurrence values is left-censored. Handling left-censored data represents a challenge for EFSA?s collection and statistical analysis of chemical occurrence data. EFSA has so far treated left-censored data with widely used substitution methods recommended by international organisations. The appropriateness of this approach has a natural limitation in the computation of percentiles and in the application of statistical techniques. An EFSA working group was established to estimate the accuracy of methods currently used and to propose recommendations for more advanced alternative statistical approaches. Based on a simulation study and on analyses of real data, an ad hoc evaluation was carried out to assess the performance of different statistical methods to handle non-detects, i.e. parametric Maximum likelihood (ML) models, the log-probit regression method and the non-parametric Kaplan-Meier (KM) method. Results showed that the number of samples had a relatively limited impact on the accuracy and precision of estimates, but the degree of censoring had a large effect. When analysing a complex set of data, it was also shown that it is essential to identify possible sources of heterogeneity in a dataset, such as country of sample collection/origin, food group, laboratory, etc. Statistical analyses should either be conducted separately from these factors, or, to explicitly account for this heterogeneity, fixed/random effect ML models could be used. Based on a minimum number of available samples and to different values of censoring percentages, the working group outlined recommendations, including the use of appropriate statistical tests, to handle left-censored distributions of chemical contaminant data in the context of exposure assessment.

Censoring of survival data is also of important influence on research result. Too high rate of censor will be lower accuracy and effectiveness of analysis result of an analytical model, increasing risk of bias. Hence, the rates of censor should be reported in articles. The result shows no articles report the censoring rate, but many articles have the phenomenon of excessive rate of censoring. For example, the calculation shows the study done by Xuexia et al[32] has censored rate up to 84%, severely influencing the results.

the paper is here: Reporting and methodological quality of survival analysis in articles published in Chinese oncology journals. But their reference 32 is impossible to track down. I am dealing with datasets with massive N and thus the censoring rate is high. My impulse is to use logistic regression but the convention within the literature is to apply Cox regression. Any thoughts, opinions or papers you can point to eg using simulations to evaluate how methods perform? cheers

really appreciate your comments. will respond more fully when i have thought more about it and looked into it further. i stumbled upon this simulation study relating to the non-PH situation: Bias Of The Cox Model Hazard Ratio:

Good point. I think this is a bit more of a goodness of fit issue than a pure censoring issue. This is a different issue, but I think in general we need to move to capturing uncertainty in proportional hazards, proportional odds, normality, etc. using full Bayesian models.

CDC WONDER is a system developed to promote information-driven decision making and provide access to detailed public health information to the general public. Although CDC WONDER contains a wealth of data, any counts fewer than 10 are suppressed for confidentiality reasons, resulting in left-censored data. The objective of this analysis was to describe methods for the analysis of highly censored data.

Although the substitution and nonspatial approach provided age-standardized rate estimates that were more highly correlated with the true rate estimates, the estimates from the spatial Bayesian model provided a superior compromise between goodness-of-fit and model complexity, as measured by the deviance information criterion. In addition, the spatial Bayesian model provided rate estimates with greater precision than the nonspatial approach; in contrast, the substitution approach did not provide estimates of uncertainty.

Because of the ability to account for multiple sources of dependence and the flexibility to include covariate information, the use of spatial Bayesian models should be considered when analyzing highly censored data from CDC WONDER.

When the goal is to assess geographic disparities in age-standardized rates between regions, overcoming the privacy protections to obtain trustworthy estimates of the age-specific rates and their levels of uncertainty is only half the battle. For instance, Fay (11) followed the work of Fay and Feuer (9) to construct interval estimates for ratios based on F distributions. Tiwari et al (10) modified this work to yield more efficient interval estimation for rates and ratios of rates from nonnested regions, work that was later extended by Tiwari et al (12) for when one subregion is nested within a larger region (eg, a county nested within a state); Zhu et al (13) extended these approaches to more accurately account for spatial autocorrelation. When the age-standardized rates must be estimated from suppressed data, further modifications must be made or these approaches will fail to adequately account for all sources of uncertainty, yielding interval estimates that may be too narrow (14,15).

The objective of this analysis was to illustrate 2 Bayesian approaches for estimating county-level mortality rates, by using heart disease mortality data from 1980 obtained from CDC WONDER (18), and to compare these results with those generated by the approach of Tiwari et al (8). In particular, we used a simple, nonspatial Bayesian model, which produces estimates similar to those from Tiwari et al (8), along with a more complex Bayesian model that accounts for spatial and between-age sources of dependence.

Although the prior specification in Equation 4 is a convenient choice, it does not take full advantage of the possibilities of Bayesian modeling. In particular, Equation 4 does not account for spatial relationships or the relationships between different age groups. To allow for such structures to be included in the model, we considered Poisson regression models, where

Although the CAR model is a powerful tool for analyzing spatial data, it does not account for possible correlation between the multiple age groups. To account for this, we instead considered a multivariate extension of the CAR model: the multivariate CAR (MCAR) model of Gelfand and Vounatsou (23). As with the CAR model, the MCAR shrinks estimates toward their neighboring values; unlike the CAR model, however, the MCAR explicitly models the between-group correlation in the data and leverages these correlations to produce more precise age-specific rate estimates. MCAR models were used recently to model spatially referenced survival times in cancer data (24), temporal trends in county-level asthma hospitalization rates (25), temporal trends in heart disease mortality by race and sex (26), and temporal trends in age-specific stroke mortality (27), among many other applications. Full details, including a discussion of the prior distributions used, are provided in the Web Appendix.

To compare the various estimation approaches, we first considered simple correlations between the estimates and the rates obtained from the complete data (as considered by Tiwari et al [8]) and correlations between the age-standardized rates and the age-specific rates. The goal of these comparisons was not to demonstrate whether one approach is superior to another but rather to demonstrate the degree to which the approaches are similar to one another. In addition, we also compared the 2 Bayesian approaches by using the deviance information criterion (DIC) (28), which uses the posterior samples to produce a measure that is a compromise between model fit (denoted by D- ) the effective number of parameters in the model. Additional details on DIC, including a discussion of its use with censored data, are provided in the Web Appendix.

The maps of the age-standardized rates generated from the raw data (Figure 1A) and the maps generated by the Poisson-gamma model (Figure 1C) have strong similarities, while artifacts of substituting state-wide averages for suppressed counts based on the approach of Tiwari et al (8) lead to elevated estimates in many rural counties in the upper Midwest (Figure 1B). In contrast, the map of the estimates from the MCAR model (Figure 1D) preserves the overall trends in the data while producing significantly smoother rate estimates.

This analysis highlighted some of the benefits of using Bayesian methods to account for left-censored data like those encountered in CDC WONDER. Although the Poisson-gamma model is a relatively simple approach, models (such as the MCAR model) that explicitly account for multivariate spatial dependence structures can lead to better inference by leveraging other sources of information to produce more reliable estimates.

Finally, although we analyzed age-specific heart disease mortality as an illustration, the MCAR model is also well suited for analyzing rarer event data via its ability to jointly model multiple outcomes. This analysis leveraged information from older age groups with higher death counts to produce more reliable estimates for those aged 35 to 44. Similarly, one could jointly model a chronic disease outcome for multiple race/ethnicities, exploiting the shared factors that may lead to increased rates for non-Hispanic white persons and racial/ethnic minorities alike. Alternatively, one could use MCAR models to simultaneously analyze multiple chronic disease outcomes with similar etiologies to improve the reliability of all estimates.