Significant result from just 2 bioreplicates

Alessandro Caioli

unread,

Feb 18, 2025, 3:25:02 PM2/18/25

to MSstats

Hi! I have a couple of questions about the differential analysis with MSStats.

I have an experiment with 13 bioreplicates, 2 technical replicates, 2 conditions and 8 fractions. When I run the model I obtain one significant protein, which does actually have a body of supporting literature for the specific issue we are investigating. The problem is that when I looked deeper into the FeatureLevelData and the profile plot for the protein in question (which you can find attached as a png file), I found out that only two bioreplicates (one for each group) actually express it, all the other bioreplicates have either low quality features for that protein (which get removed with featureSubset=TopN) or missing values that can not be imputed by the ATF model.

I have two questions about this situation:

Am I correct in assuming that despite the very low adjusted p-value I cannot trust this result since it comes from comparing just two bioreplicates?
Since I have a lot of missing values that cannot be imputed with the ATF model, is there any way I could use other strategies (like minDet, minProb, etc.) on the remaining missing values? For example, I was thinking to impute the values at the FeatureLevelData wherever censored is TRUE and newABUNDANCE is missing, but I guess I would have to rerun the TMP portion of dataProcess again on the new FeatureLevelData to obtain the ProteinLevelData. Is it possible and/or is it a terrible idea?

Thanks for the help as always,

Alessandro

Screenshot 2025-02-18 at 21.19.27.png

Anthony Wu

unread,

Feb 27, 2025, 7:08:19 PM2/27/25

to MSstats

Hi,

Yes, I agree with your assessment that you should be careful with the result despite the low adjusted p-value since there are only 2 bioreplicates.

Regarding imputing, it is possible to impute with other strategies and then rerun TMP (this is possible by running each subfunction of dataProcess separately, see the example at this url). However, I'd advise against that idea. Since you only have two biological replicates, any alternative imputation strategy is likely to overfit to the specific abundance values of those 2 bioreplicates, reducing the reliability of the imputed values.

Instead, I'd investigate why those values are missing for that specific protein. If you expect that protein to be present based on the supporting literature, it's somewhat surprising that it's missing in almost all runs. I can think of two possibilities here:

1. There is a true biological explanation that all other bioreplicates would have low abundance of that protein

2. There was a processing issue by your peptide ID / quant software. E.g. the spectra produced by that protein's peptides was misidentified with a different protein's peptides.

Thanks,

Tony

Alessandro Caioli

unread,

Apr 3, 2025, 3:12:23 AM4/3/25

to MSstats

Hi Tony,

just wanted to thank you very much for your reply and your feedback, somehow I missed it and noticed it now.