5 imputed dataset in Training dataset and test dataset (SAS)

35 views
Skip to first unread message

Dandi

unread,
Apr 1, 2020, 5:15:11 AM4/1/20
to Missing Data
Dear all,

Is there any option or statement I can use to separate my 5 imputed datasets into the training set and the test set? I know how to do it if I only have 1 data, but not sure about the 5 imputed datasets.

Thanks!

Dandi

Jonathan Bartlett

unread,
Apr 1, 2020, 6:06:19 AM4/1/20
to missin...@googlegroups.com
Hi Dandi

Can you not just create a binary variable called training (0/1) before you do the imputation that indicates whether an observation will be in the training or test data. This variable will then be there in each of the imputed datasets, and could then split each imputed dataset into training and test. Another question is whether if you want to use a training/test approach for prediction you should impute the whole dataset first or instead impute the training and test data separately. I don't know what the right answer is to this question, I am just raising it.

Best wishes
Jonathan

--
You received this message because you are subscribed to the Google Groups "Missing Data" group.
To unsubscribe from this group and stop receiving emails from it, send an email to missing-data...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/missing-data/3b5f8138-2b3a-4036-9dcb-9b8bc21fe81e%40googlegroups.com.

Dandi

unread,
Apr 1, 2020, 7:38:16 AM4/1/20
to Missing Data
Hi Jonathan,

Thank you for your reply. My case is a bit different. I also saw many people separate the training set and test set before they did multiple imputations. But in my cases it is different. I first use different variables and a bigger sample size to do multiple imputations. And take the variable and patient I need for this model. For example, I used a, b, c, d, e to do FCS multiple imputations with 500 patients to get 5 big imputed datasets. I did this because this dataset is to prepared for several projects in our department, and with many important variables (I have consulted with clinicians for variable selection), the result will be more reliable. 

And take variable d, e to add in my prediction model. So my prediction model will have d, e, g, h,z with 250 patients and 5 imputed datasets (250 patients from the 500 patient above, because of specific patient criteria selection). So start from now on, I start to do the prediction model study. 

Now I should separate them in the training set and test set. I am not sure shall I do random sampling in one dataset and keep the other 4 with the same patients. Or do 5 times random sampling.

Best,

Dandi

On Wednesday, April 1, 2020 at 12:06:19 PM UTC+2, Jonathan Bartlett wrote:
Hi Dandi

Can you not just create a binary variable called training (0/1) before you do the imputation that indicates whether an observation will be in the training or test data. This variable will then be there in each of the imputed datasets, and could then split each imputed dataset into training and test. Another question is whether if you want to use a training/test approach for prediction you should impute the whole dataset first or instead impute the training and test data separately. I don't know what the right answer is to this question, I am just raising it.

Best wishes
Jonathan

On Wed, 1 Apr 2020 at 10:15, Dandi <dandi....@gmail.com> wrote:
Dear all,

Is there any option or statement I can use to separate my 5 imputed datasets into the training set and the test set? I know how to do it if I only have 1 data, but not sure about the 5 imputed datasets.

Thanks!

Dandi

--
You received this message because you are subscribed to the Google Groups "Missing Data" group.
To unsubscribe from this group and stop receiving emails from it, send an email to missin...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages