Samitrop Labels

139 views
Skip to first unread message

Martin Sondermann

unread,
Mar 14, 2025, 10:57:50 PM3/14/25
to physionet-challenges

Hello Challenge Organizers,

We've noticed something interesting about the datasets provided for the Challenge and would appreciate some clarification.

According to the Challenge description, the SaMi-Trop dataset contains 1,631 12-lead ECG records, but only half of these records are included in the training set. This differs from the other datasets (CODE-15% and PTB-XL), which appear to be provided in full for training.

Specifically:

  1. The samitrop_chagas_labels.csv file contains approximately 815 records (half of the 1,631 mentioned)
  2. The documentation states: "These data are publicly available, and half are included in the Challenge training set."

I'm curious about:

  • Is there a specific methodological reason for including only half of the SaMi-Trop dataset in training, while providing the other datasets in full?
  • Are there particular characteristics of the withheld SaMi-Trop records that might be important to consider for model development?
  • Since the rest of the data is avaible throu the dataset, is it possible to still include them, as the chagas positive label is theoretically known?

Thank you for your guidance on this matter.

Best regards


PhysioNet Challenge

unread,
Mar 14, 2025, 11:02:19 PM3/14/25
to physionet-challenges
Dear Martin,

Good observation. For the unofficial phase, we are including half of the SaMi-Trop dataset in the training set and the other half of the SaMi-Trop dataset in the validation set. Both halves of the SaMi-Trop dataset have approximately the same demographic and Chagas prevalence rates. We split the SaMi-Trop dataset between the training and validation sets for the unofficial phase because we were still preparing the other datasets.

For the official phase, we will include all of the SaMi-Trop dataset in the training set and include a new, unreleased dataset in the validation set, so please expect some changes for the official phase, and please continue to share suggestions that can help us improve the Challenge ahead of the official phase.

This shouldn't create any problems for the unofficial phase because (1) we are training models on the training set and evaluating them on the validation set, which do not and will not overlap, and (2) this is still the unofficial phase, when we are all "kicking the tires" of the Challenge. Observations and suggestions are especially welcome during the unofficial phase. Of course, if someone's method uses the data source as a feature, then it may improve its performance on the validation set during the unofficial phase, but it will generally not improve the generalizability of the approach on unseen data...

Best,
Matt
(On behalf of the Challenge team.)

Please post questions and comments in the forum. However, if your question reveals information about your entry, then please email info at physionetchallenge.org. We may post parts of our reply publicly if we feel that all Challengers should benefit from it. We will not answer emails about the Challenge to any other address. This email is maintained by a group. Please do not email us individually.
Reply all
Reply to author
Forward
0 new messages