Hi all,
I'm running MSstats v4.18.1 on a label-free DIA dataset (Spectronaut output) from an ultra-low input laser capture microdissection (LCM) experiment profiling the astrocyte niche around amyloid plaques in a mouse model. I'm comparing three conditions: control, plaque_near, and plaque_far. I have biological replicates with multiple technical replicate runs per biological replicate. Given the ultra-low input nature of LCM, missingness is a particular concern in this dataset.
Before dataProcess, I applied a custom feature-level filter requiring ≥50% observation in at least one condition, and used featureSubset = "highQuality" with remove_uninformative_feature_outlier = TRUE.
My dataProcess function was:
processed_data <- dataProcess( msstats_input_filtered_keratin, normalization = "equalizeMedians", summaryMethod = "TMP", censoredInt = "NA", MBimpute = FALSE, featureSubset = "highQuality", remove_uninformative_feature_outlier = TRUE )
I have two concerns:
1. How does the floor replacement actually work in practice?
The documentation states that censoredInt = "NA" with MBimpute = FALSE replaces censored NAs with the cutoffCensored value (minimum observed intensity per feature). My concern is that if multiple NAs within a condition are replaced with the same floor value, this injects identical data points into the TMP summarisation, which could artificially reduce within-group variance at the protein level. A smaller variance would shrink the standard error in groupComparison, potentially producing overconfident t-statistics and false positives. Is this a valid concern, or does TMP's median-based summarisation handle this robustly?
I initially tried MBimpute = TRUE (table attached), but this produced a far greater number of DAPs than seemed biologically reasonable for this experiment.
MBimpute = FALSE gave more conservative and plausible results (table attached) which is why I opted for this approach. However, I want to make sure the floor replacement isn't introducing a subtler version of the same problem.
As you can see from the table, MSstats picks up far more signficant plaque far vs control signficant proteins compared to limma. When plotting the raw values, MSstats is able to identify these DAPs better than limma, and further analysis has demonstrated that these proteins do seem to be consistently
2. Why does censoredInt = NULL produce identical results to censoredInt = "NA"?
To test whether the floor replacement was affecting my results, I reran dataProcess with censoredInt = NULL and MBimpute = FALSE, which should treat all NAs as randomly missing with no replacement. The output was identical to censoredInt = "NA".
Initially I assumed this was because my upstream 50% observation filter removed most sparse features, leaving very few NAs. But if that's the case, do featureSubset = "highQuality" and remove_uninformative_feature_outlier = TRUE already handle missingness aggressively enough that the censoredInt setting becomes redundant? Or is there another reason these two settings would produce the same result?
Happy to share more details about the experimental design / my code if useful, and I hope my confusion made sense.
Thanks!
Carlo