Dear Martin,
Good observation. For the unofficial phase, we are including half of the SaMi-Trop dataset in the training set and the other half of the SaMi-Trop dataset in the validation set. Both halves of the SaMi-Trop dataset have approximately the same demographic and Chagas prevalence rates. We split the SaMi-Trop dataset between the training and validation sets for the unofficial phase because we were still preparing the other datasets.
For the official phase, we will include all of the SaMi-Trop dataset in the training set and include a new, unreleased dataset in the validation set, so please expect some changes for the official phase, and please continue to share suggestions that can help us improve the Challenge ahead of the official phase.
This shouldn't create any problems for the unofficial phase because (1) we are training models on the training set and evaluating them on the validation set, which do not and will not overlap, and (2) this is still the unofficial phase, when we are all "kicking the tires" of the Challenge. Observations and suggestions are especially welcome during the unofficial phase. Of course, if someone's method uses the data source as a feature, then it may improve its performance on the validation set during the unofficial phase, but it will generally not improve the generalizability of the approach on unseen data...
Best,
Matt
(On behalf of the Challenge team.)
Please post questions and comments in the forum. However, if your question reveals information about your entry, then please email info at
physionetchallenge.org. We may post parts of our reply publicly if we feel that all Challengers should benefit from it. We will not answer emails about the Challenge to any other address. This email is maintained by a group. Please do not email us individually.