(Mis)using features that are noncasually correlated with the Challenge labels

116 views

Skip to first unread message

PhysioNet Challenge

unread,

Jun 16, 2025, 1:12:40 PMJun 16

to physionet-challenges

Dear Challengers,

Over the past few weeks, we have received many posts and emails about the Challenge data and differences between the different Challenge data sources. Some of these posts and emails observe, correctly, that identifying data sources is easier than identifying the labels, and they express concern that some teams may be intentionally or unintentionally identifying the data sources instead of, or as a proxy for, identifying the labels. We understand and appreciate these concerns, and we want to more directly warn teams about them.

We emphasize that the validation and test data are from different sources than the training data. The validation and test data are not publicly available, and they have different qualities than the training data. Approaches that learn qualities specific to the training set, and, more specifically, the differences between data sources in the training set, are unlikely to perform well on the validation and test sets. These choices are deliberate because real-world algorithms are inevitably applied to data that were not used for training.

During the unofficial phase, we observed that many algorithms made the mistake of training on features of the databases that are noncausally correlated with the target class. These features include the names and sampling frequencies of the databases, among other differences. Any algorithm that relies on the names or sampling frequency of a database to classify a record will perform poorly on records with new or missing names or different sampling frequencies when the signal is substantially the same. These features also include more subtle features of the signals, such as the power around 50Hz and 60Hz, among other differences.

Some of the posts and emails suggested that we resample the training data when creating the training set to address these issues for the teams. However, we deliberately decided not to process the public data for the teams beyond unifying the different "raw" data formats. Resampling would not be enough to disguise the inter-database differences, and we want to see how the teams approach this problem. It is up to each team to decide on its strategies for preprocessing and training to ensure that their algorithms do not learn noncausally related features.

Before the official phase, we intentionally developed a classifier which was designed to identify databases as part of adversarial attacks on our validation and test data to ensure that teams could not learn these features to perform well on the validation and test data. We encourage each team to consider a similar exercise to understand if they are explicitly or implicitly using noncausally correlated features, especially if they have an official phase that performs much better on the (cross-validated) training set than on the validation set.

We have intentionally designed the Challenge to highlight such issues. Many publications combine different databases without properly considering their differences, limiting generalizability. We think that preprocessing is as important (often more important) than the downstream processing, especially in an era of large databases and complex machine learning models that need and deserve proper scrutiny to understand and utilize.

All the best,

Gari, Matt, Reza, and the Challenge team.

Please post questions and comments in the forum. However, if your question reveals information about your entry, then please email info at physionetchallenge.org. We may post parts of our reply publicly if we feel that all Challengers should benefit from it. We will not answer emails about the Challenge to any other address. This email is maintained by a group. Please do not email us individually.

Reply all

Reply to author

Forward

0 new messages