Prevalence

164 views
Skip to first unread message

Arnaud Champetier

unread,
Mar 9, 2025, 10:28:49 PMMar 9
to physionet-challenges
Dear Physionet team, 

I first would like to thank you for the great organization. 

My question is regarding the prevalence on the test set. Code 15% was not designed for screening chagas. If I understand correctly samitrop was designed for that. Hence the difference in prevalence in both dataset. The true positive rate is prevalence dependent and in order to properly train the model we should know in which type of population its use is intended. If further use is meant for screening people susceptible of having Chagas, then the prevalence of the dataset in which we train the final version of our model should be close to it. 

In other word is the prevalence in the test dataset close to the samitrop or close to the Code 15%? 
I believe that this has a huge impact on the metric and would therefore save some time in experimentation for all the team. 

Thank you for your time and devotion, 

Kind regards, 

Arnaud

PhysioNet Challenge

unread,
Mar 9, 2025, 10:30:29 PMMar 9
to physionet-challenges

Dear Arnaud,


Thank you for the kind words.


The prevalence of Chagas disease in the training, validation, and test sets approximately matches the prevalence of Chagas disease in Brazil, which is the source of the Chagas data for the Challenge. The prevalence of Chagas disease in Brazil is approximately 3%.


The training set contains a mixture of large datasets with weak labels and smaller datasets with strong labels, including

  • the CODE-15% dataset, which contains over 300,000 12-lead ECGs with self-reported Chagas labels;

  • the SaMi-Trop dataset, which contains 1,631 12-lead ECGs with serologically validated positive Chagas labels; and

  • the PTB-XL dataset, which contains 21,779 12-lead ECGs with (most likely) negative Chagas labels from a non-endemic area of the world.

The validation and test sets will include datasets with strong labels with data from sources that are not represented in the training set.


The combination of strong and weak labels is a product of real-world problems and datasets, where data and labels are noisy, and approaches that recognize and remediate noise tend to improve their performance. Patients may or may not correctly report their Chagas status for a number of reasons, so the labels for the CODE-15% dataset may or may not reflect their actual Chagas status, but they should still provide information. How you use this information, and combine the large number of weak labels with the smaller number of small labels, is up to you.


Best,

Matt

(On behalf of the Challenge team.)

Please post questions and comments in the forum. However, if your question reveals information about your entry, then please email info at physionetchallenge.org. We may post parts of our reply publicly if we feel that all Challengers should benefit from it. We will not answer emails about the Challenge to any other address. This email is maintained by a group. Please do not email us individually.
Reply all
Reply to author
Forward
0 new messages