Hi Isaac,
That is a very interesting question you asked! I should start by saying that the confidence interval is used for computing some of the performance measures (MAUC, WES, CPA). That is, the confidence interval should be constructed such that 50% of the values in the (D4) test set lie in that region. Now whether you can estimate that based on cross-validation folds from the training set, or using a validation set it up to you and is an assumption your method would make. Note that you might not have to assume Gaussianity to compute them (e.g. Ventricle volumes are always positive so a Gaussian noise model is not ideal there as it can take negatives).
How exactly you "estimate" the confidence interval is a matter of statistical science. What you mentioned are two potentially good estimators of a "true" unobserved confidence interval. Now some estimators could have higher variance and low bias, or low variance and high bias, and it is a matter of you choosing the right one based on your model or just intuition. You can also have more complex models of estimating the conf. interval: for example, you could have an a-priori belief that in the D4 test set there is a distribution shift as compared to D1-D2 -- maybe scans in ADNI3 are higher-resolution and that would give you more precise Ventricle Volume measurements, so you can lower the width of the confidence interval. You can encode this into your model if you want, or whatever else you think might be important.
I hope this helps.
Raz