Dear INLA group,
I have a question regarding the interpretation of three validation metrics obtained using the inla.group.cv function. My goal is to compare a calibration model (a spatio-temporal regression model) against a data fusion model (Bayesian melding), where the latter is a joint model with two likelihoods.
For each model, I computed the following monthly metrics:
Negative logarithmic score (LS)
Dawid-Sebastiani score (DS)
Kullback-Leibler divergence (KLD)
The results are provided in the attached file. My question concerns an apparent inconsistency:
The KLD suggests that the data fusion model performs better across all months.
However, the LS and DS metrics indicate either similar performance between the two models or a slightly better performance of the calibration model in certain months.
I would have expected these metrics to show more coherent behavior. Additionally, other metrics I computed (RMSE and MAE, not shown here) align more closely with LS and DS, suggesting comparable or slightly better performance for the calibration model.
How can I reconcile the KLD results with the other metrics? Is there an interpretation or methodological consideration I might be missing?
I would greatly appreciate any insights or suggestions you might have.
Thanks for your help,
Guido