Parallelization for validation and test set

Charlotte Maschke

unread,

Aug 14, 2023, 9:24:33 AM8/14/23

to physionet-challenges

Dear challenge organizers,

I am using multiprocess in the train_challenge_model function of team_code.py to parallelize each subect's feature extraction on one a separate CPU.

The feature extraction of our code is quite computationally heavy. In the cross-validation you calculate all features individually over 12, 24, 48 and 72 hours, which again mutliplies the needed time for feature extraction.

I was planning to parallelize the 'run_challenge_models' in the same way as the 'train_challenge_model' function. However, the run_model.py (which we are not supposed to edit) already contains the loop over subjects. run_challenge_models (the one which I can edit) has only one subject as an input. The parallelization would therefore need to happen in the run_model.py.

The only possibility to include parallelization over subjects in this step would be to change the run_model() function to assign one cpu to the feature calculation and prediction of one subject.

Is there a way to do this ? Could you maybe just add an n_jobs to this function?

Thank you for your answer.

Best,

Charlotte

PhysioNet Challenge

unread,

Aug 14, 2023, 9:40:17 AM8/14/23

to physionet-challenges

Dear Charlotte,

Thank you for your question. Yes, run_model.py does process one patient at a time, and this is by design: allowing this parallelization would make it easy to unintentionally learn from the test data. We can’t make an exception in this case. However, if the goal is to save time by parallelization, you might want to parallelize feature extraction on the patient level instead.

Best,

James

Charlotte Maschke

unread,

Aug 14, 2023, 3:22:33 PM8/14/23

to physionet-challenges

Dear James,

thank you for your reply. I understand the design choice in this case and I already parallelize for the feature extraction. The website says, you impose a 24 hour time limit on running trained models on the validation data. However, you are repeating this process 4 times for 12, 24, 48 and 72 hours. Are the 24 hours renewed each time you restart a new validation? Otherwise the total time you allow for one prediction is not even 2 minutes (assuming 200 validation subjects), which is really few for preprocessing, feature extraction and model prediction.

Thank you for your clarification on this.

Best,

Charlotte

PhysioNet Challenge

unread,

Aug 14, 2023, 4:53:06 PM8/14/23

to physionet-challenges

Dear Charlotte,

I am updating my previous reply to post this to the whole group.

Yes, the 24 hours are renewed. Your model gets 24 hours for each timepoint: 24 hours for the 12-hour validation data, 24 hours for the 24-hour validation data, and so on.
Additionally, since the test data is 3 times larger, your model gets 72 hours instead of 24 hours for the test data.

Best,

James

Charlotte Maschke

unread,

Aug 15, 2023, 10:55:20 AM8/15/23

to physionet-challenges

Dear James,

this makes a considerable difference to the test phase! Your answer helps a lot.

Thank you.

Charlotte

Charlotte Maschke

unread,

Aug 15, 2023, 11:25:22 AM8/15/23

to physionet-challenges

Dear James,

Thank you for this information!

I have one follow up question:

Would it be possible to provide us with the number of subjects in the validation and test set?

Best,

Charlotte

PhysioNet Challenge

unread,

Aug 15, 2023, 12:33:25 PM8/15/23

to physionet-challenges

Dear Charlotte,

Sure! There are 107 patients (approximately 10%) in the validation set and 306 patients (approximately 30%) in the test set. Please make sure you don’t rely on these exact numbers in your code.