about test set

349 views
Skip to first unread message

can cao

unread,
Jun 18, 2020, 3:30:39 PM6/18/20
to physionet-challenges
Dear Organizers,

I found a lot of data with 2 labels in the PTB-XL(as fellow):HR00172(SNR and CRBBB ), HR00175(MI and SNR) and so on. I think if the data was labled as SRN, it should not be labled as CRBBB or MI. and if the data was labeled as CRBBB or MI, the SNR's likehood is usually set to 0 in ptbxl_database.csv(https://physionet.org/content/ptb-xl/1.0.1/). if i am right,  there are many errors or debatable labels in the PTB-XL(as fellow)
My question is whether there are many debatable labels in the test set?

PTB-XL from:

Best,
Cao

PhysioNet Challenge

unread,
Jun 18, 2020, 3:53:17 PM6/18/20
to can cao, physionet-challenges
Dear Cao, Competitors,

This is a really fun and important question. Thanks for asking this - it highlights just how important it is that you go back to the original data and read about how they were collected, processed, classified and labelled/formatted. 

To summarize this long email - it is so important that you go back to the source data and look at how it is generated, and be appropriately skeptical of any dataset you are given. 
In this case you have misinterpreted 0.0 as zero likelihood - which is not what the documentation of that data indicates. (Yes - NaN should have been used, instead of 0.0, but that was not our choice - we did not create that dataset). But let's back up to the larger question ...

First, you should be aware that there are few 'absolute' diagnoses in 12 lead ECGs (or any medical scenario for that matter).  Even with a group of 21 experts, each of whom had more than 20 years of ECG reading experience each (and were internationally renown experts), large discrepancies in the interpretation of the ECG can be found (see Bond et al. [1]).  This is one reason it is called ECG 'interpretation'. There is inevitably something lost in translation, especially when you do not have the context. This doesn't mean making an algorithm isn't important ... all 12 lead ECG machines make first pass diagnoses. 

To answer the immediate question, please note that the original PTB-XL database has a list of both diagnoses and 'likelihoods' in the CSV annotation file:
which can be found in the "scp_codes" column.  Note also that the documentation indicates that a likelihood of 0.0 means the likelihood is "unknown" (not zero)! 
In effect, when it takes. A value of 0.0 it becomes a confidence in the likelihood. I know - don't ask why they did this -perhaps they use a tool that couldn't store anything but floats.
For this reason, we have not removed such diagnoses. There is evidence and logic to back this up. First, the logic. Why would you label a diagnosis as having zero probability unless you were going to list the full set of zero probability diagnoses. It seems random to choose just one. Second, if you examine the free text field, it seems to agree with the fact that the diagnosis with 0.0 likelihood is in fact a real diagnosis. 

Subject ID: 77 has a report of "sinusrhythmus unvollstÄndiger rechtsschenkelblock", which translates as "sinus rhythm incomplete right bundle branch block", and the SCP Code = {'AMI': 50.0, 'IRBBB': 100.0, 'SR': 0.0}.  You can certainly be in sinus rhythm and also have an incomplete RBBB. So you can see that the 0.0 doesn't reflect a zero likelihood that the patient is in sinus rhythm - it's just that they are not sure about how confident they are the patient is in sinus rhythm, and feel you could argue it either way (potentially). Their confidence in their confidence is zero!!! Why they chose to use such a confusing 'flag' for this, I don't know, but I would suggest you interpret this as 'NaN' - not a number.

With respect to the quality of the test set - as you would expect, there are no patients in the test set that are present in the training set (to our knowledge - or at least we judge the probability to be very low). There are multiple test databases. At least one is drawn from the same source as one of the training data sets, and at least one is completely original and independent of the training set. All test data has never been posted publicly before, and were collected (and over-read by at least one cardiologist) during routine diagnoses in the clinic. The labels are therefore as good as we can expect in standard clinical practice, but no better. For this reason, we generally do not penalize labels that involve blood work and other additional confirmatory procedures.  (See upcoming scoring matrix.)

This is really the most important point of the competition. You must learn how to deal with noisy labels in big data!

There is another key issue here: Each database is labelled using a different ontology, or subset of terms in an ontology (or sometimes no ontology - just free text). We therefore had to make a call about how to map these. For example, we have the following four labels for ventricular ectopic beats:
 
Description, SNOMED Code, Abbreviation
premature ventricular complexes,164884008,PVC
premature ventricular contractions,427172004,PVC
ventricular ectopic beats,17338001,VEB
ventricular premature beat, 17338001, VPB

You'll note that while we have chosen to retain the distinction between these in terms of SNOMED codes,  (although we have merged PVCs, because we could really see no reason they had two separate codes), in the scoring matrix they carry the same weight, and mixing them up doesn't cost you any points. You may then ask, 'why not merge them all in the labelling'? Well that's a question you have to answer for yourself. You are certainly welcome to do that - but you may not want to. You may note that only VPB indicates the temporal location of the beat relative to the preceding normal beat. This may, or may not, affect your algorithm, depending on how you write your code. You may or may not want it to affect your algorithm - the relative timing of beats certainly given you information! We have therefore tried to provide you with as much useful information as possible, without overwhelming you with a complete data dump.

Hopefully this information is beginning to reveal just how complex and difficult this problem can be. The PhysioNet/CinC Challenges are not ordinary public competitions. We call it a 'Challenge' for a reason - we are working together to discover best practices, and highlight the problems in the field, and hopefully present solutions. This is not an artificial potted dataset on which the metric encourages you to overfit. It is a real collection of data, with real world issues. The subjectivity and degeneracy of medical labels is something you need to think about closely, and perhaps incorporate into your classifiers.

Thank you again for the question - it is these types of discussions that make the Challenge a productive contribution to the scientific community, and are as important as the final results. 

One final note - please forgive us if we take time to respond to questions. Like many of you, we have young children at home with us, have many other projects that require our attention (including projects related to COVID-19) and are also dealing with family and friends' illnesses. Today's response was rapid because it sparked a very important and deep issue I felt needed clarification.

Good luck and keep the questions coming! (We'll do our best to get to them.)

-Gari
 

[1] Bond RR, Zhu T, Finlay DD, Drew B, Kligfield PD, Guldenring D, Breen C, Gallagher AG, Daly MJ, Clifford GD. Assessing computerized eye tracking technology for gaining insight into expert interpretation of the 12-lead electrocardiogram: an objective quantitative approach. J Electrocardiol. 2014 Nov-Dec;47(6):895-906




--
You received this message because you are subscribed to the Google Groups "physionet-challenges" group.
To unsubscribe from this group and stop receiving emails from it, send an email to physionet-challe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/physionet-challenges/67dad4d3-f625-42b9-a696-f7aea96ca63co%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages