Dear Arnaud,
Thank you for the kind words.
The prevalence of Chagas disease in the training, validation, and test sets approximately matches the prevalence of Chagas disease in Brazil, which is the source of the Chagas data for the Challenge. The prevalence of Chagas disease in Brazil is approximately 3%.
The training set contains a mixture of large datasets with weak labels and smaller datasets with strong labels, including
the CODE-15% dataset, which contains over 300,000 12-lead ECGs with self-reported Chagas labels;
the SaMi-Trop dataset, which contains 1,631 12-lead ECGs with serologically validated positive Chagas labels; and
the PTB-XL dataset, which contains 21,779 12-lead ECGs with (most likely) negative Chagas labels from a non-endemic area of the world.
The validation and test sets will include datasets with strong labels with data from sources that are not represented in the training set.
The combination of strong and weak labels is a product of real-world problems and datasets, where data and labels are noisy, and approaches that recognize and remediate noise tend to improve their performance. Patients may or may not correctly report their Chagas status for a number of reasons, so the labels for the CODE-15% dataset may or may not reflect their actual Chagas status, but they should still provide information. How you use this information, and combine the large number of weak labels with the smaller number of small labels, is up to you.
Best,
Matt
(On behalf of the Challenge team.)