Optimizing underestimated prediction with missing covariates in recent years

15 views

Skip to first unread message

Kai

unread,

Jul 18, 2024, 6:44:22 AM (3 days ago) Jul 18

to R-inla discussion group

Dear R-INLA community

I am doing spatial-temporal disease mapping for predicting count data. The prediction in more recent years (2023) looks not good (see graph below), compared to 2021. I guess this is because many of my covariates are 100% missing (not reported yet by the open data provider) in recent years (2022 & 2023), and the intercept (-11.851) made everything down while these missing covariates are regarded as 0 in INLA.

My main question is, does anyone think missing covariates imputation or joint model will be a good solution (Virgilio Gómez-Rubio et .al 2029. Missing data analysis and imputation via latent Gaussian Markov random fields) in this case?

About my model: The response variable is a count (suspicious polio cases, i.e. Non-Polio AFP) and the covariates are numerical and manually scaled already. The dataset has been aggregated by month and by district (admin 2 level) across three countries. I used negative binomial and bym2 for spatial and ar1 for year and grouped by month. I fit the all the years with the response variable between 2021 and 2023 given NA.

In the meantime, I am other issue that I am still struggling/reading / trying:

adjust my prior - three different priors applied to three different countries would be nice?
I need to do variable selection, either manually checking WAIC or a stepwise covariate selection (Xavier Vollenweider, et al). Or use prior for variable selection.
Cross validation. Would like to try leave group out (A. Adin et al. Automatic cross-validation in structured models: Is it time to leave out leave-one-out? 2024)
I know how to do scoring rules after prediction.
One stupid question. Why people always use ~1+ for the intercept in the formula?
The predicted graph looks a u-curve within a year. I may need to try other methods, e.g. rw1, or only use indexed-months a time variable?
Is res$summary.fitted.values[,"mean"] * df_all_gis_scale$population a right way to extract my count prediction given my negative binomial distribution?

Apologies for this long post. I will appreciate if you spot any issues here. Any recommendations are welcome! Thanks for developing this such beautiful INLA method!

Wish you have a great day!!!

Kai

###################CODE##################

#model

pc_prior <- list(prec = list(prior="pc.prec",

param = c(3* sd(df_all_gis_scale$no_npafp),0.01)))

temp_formula_prediction <- as.formula(

no_npafp_predicted ~ 1+conflict_total_events + sia + rain_r1h_avg +

interval_incidence_province + stool_adequacy + google_trend +

vegetation_anomaly + c_influenza + c_report_measles + c_financing_gghegge +

c_wash_hygiene + c_infant_mortality + c_development_gdp +

c_vac_coverage_polio + c_temperature +

f(index_district, model = "bym2", graph = graph_neighbour_list, scale.model = TRUE) +

f(index_year, model = "iid", group = month, control.group = list(model = "ar1", scale.model = TRUE), hyper = pc_prior))#index year ranges from1:6; month ranges from 1 to 12. (non-cyclic)

#fit

res <- inla(data = df_all_gis_scale, E = population, family = "nbinomial",

control.predictor = list(compute = TRUE, link=1), control.compute = list(dic=TRUE,waic=TRUE,return.marginals.predictor=TRUE,cpo=TRUE),

formula = temp_formula_prediction ) #response variable between 2021 and 2023 has been replaced by NA

#the fit summary is attached.

The predicted count data = res$summary.fitted.values[,"mean"] * df_all_gis_scale$population # as I saw somewhere that you need to transform it back by multiplying population when you did a offset, right???

graph_prediction.png

model_summary.txt

Helpdesk (Haavard Rue)

unread,

Jul 20, 2024, 12:19:30 PM (yesterday) Jul 20

to Kai, R-inla discussion group

On Thu, 2024-07-18 at 02:52 -0700, Kai wrote:
> Dear R-INLA community
> I am doing spatial-temporal disease mapping for predicting count data. The
> prediction in more recent years (2023) looks not good (see graph below),
> compared to 2021. I guess this is because many of my covariates are 100%
> missing (not reported yet by the open data provider) in recent years (2022 &
> 2023), and the intercept (-11.851) made everything down while these missing
> covariates are regarded as 0 in INLA.

> graph_prediction.png
> My main question is, does anyone think missing covariates imputation or joint
> model will be a good solution (Virgilio Gómez-Rubio et .al 2029. Missing data
> analysis and imputation via latent Gaussian Markov random fields) in this
> case?

if you have missing covariates, like x[10] is NA, then this is automatically
translated to 0, meaning no contribution to the linear predictor from x at index
10.

if you have missing covariates, and to such an extent that is makes a
difference, then you can either

- do nothing, then 0's are used. this is likely not a good idea

- build a model for the covariates in the model, and try to use that one to
predict or impute covariates values, and plug these into the main model. you can
do several 'realisations' and merge the results (using inla.merge() )

- for continuous covariates only, you can build a joint model for the covariates
and data and estimate them (ie the `imputations` jointly). this is somewhat more
involved.

> In the meantime, I am other issue that I am still struggling/reading / trying:

> 1. adjust my prior - three different priors applied to three different
> countries would be nice?

you can specify this in the model

> 2. I need to do variable selection, either manually checking WAIC or a

> stepwise covariate selection (Xavier Vollenweider, et al). Or use prior for
> variable selection.

for this you need to run several runs, each for each covariate selection and
manually filter the results using WAIC or marginal likelihood

> 3. Cross validation. Would like to try leave group out (A. Adin et al.

> Automatic cross-validation in structured models: Is it time to leave out
> leave-one-out? 2024)

yes, there is a vignette about that

> 5. One stupid question. Why people always use ~1+ for the intercept in the
> formula?

adding -1 in the formula, prevent adding the automatic intercept to the model.
sometimes, a common intercept make not sense, and its better to use f.ex

~ -1 + intercept + ...

where 'intercept' is a factor, giving one intercept pr factor level.

> 6. The predicted graph looks a u-curve within a year. I may need to try

> other methods, e.g. rw1, or only use indexed-months a time variable?

you can do that.

> 7. Is res$summary.fitted.values[,"mean"] * df_all_gis_scale$population a

> right way to extract my count prediction given my negative binomial
> distribution?

seems so, yes

> --
> You received this message because you are subscribed to the Google Groups "R-
> inla discussion group" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to r-inla-discussion...@googlegroups.com.
> To view this discussion on the web, visit
> https://groups.google.com/d/msgid/r-inla-discussion-group/4d2e3c31-8992-4944-9961-677b665fc539n%40googlegroups.com
> .

--
Håvard Rue
he...@r-inla.org

Reply all

Reply to author

Forward

0 new messages