Issues in MSstatsPTM::dataSummarizationPTM_TMT

61 views
Skip to first unread message

Sam Siljee

unread,
Feb 27, 2025, 2:57:36 PMFeb 27
to MSstats
Hi MSstats team,

I'm running into a problem part-way through the data summarization step of my phopsphoproteomics data.
The protein-level summarization appears to complete just fine for the phosphoproteomics data, and the first 4 replicates of the proteomics data, before crashing halfway through the fifth replicate.
It gives the following error message:

`Warning in dcast.data.table(LABEL + RUN ~ FEATURE, data = input, value.var = "newABUNDANCE", : 'fun.aggregate' is NULL, but found duplicate row/column combinations, so defaulting to length(). That is, the variables [LABEL, RUN, FEATURE] used in 'formula' do not uniquely identify rows in the input 'data'. In such cases, 'fun.aggregate' is used to derive a single representative value for each combination in the output data.table, for example by summing or averaging (fun.aggregate=sum or fun.aggregate=mean, respectively). Check the resulting table for values larger than 1 to see which combinations were not unique. See ?dcast.data.table for more details. <simpleError in .Primitive("length")(newABUNDANCE, keep = TRUE): 2 arguments passed to 'length' which requires 1> Warning in merge.data.table(input[, colnames(input) != "newABUNDANCE", with = FALSE], : Input data.table 'y' has no columns. Error in merge.data.table(input[, colnames(input) != "newABUNDANCE", with = FALSE], : The following columns listed in `by` are missing from y: [FEATURE, RUN, cen] `

These are the same PSM and annotation files I used for TMT proteomics using `MSstatsTMT`, with no problem.
I'm using R 4.4.2, MSstastsPTM 2.8.1, MSstats 4.14.1
I'll see if I can subset and anonymize the data to upload as an example.

Many thanks,

Sam

Sam Siljee

unread,
Feb 27, 2025, 8:42:46 PMFeb 27
to MSstats
Here is a link to my data (peptide sequences and protein names randomised).
I've also reduced the proteomics data to only include the technical replicate that was causing the function to error.


Sam

Sam Siljee

unread,
Mar 2, 2025, 5:27:30 PMMar 2
to MSstats
After some trial an error, I have narrowed the issue down to two PSMs from the same run in the input. If they are both included, the data processing fails, if either are filtered out first, the data processes successfully.
I've attached a .csv of the two PSMs causing trouble.
I also suspect that the problem may be occurring sooner, as I hadn't noticed that there were suppressed warnings from the `MSstatsPTM::PDtoMSstatsPTMFormat` call. The warning messages are " In `[.data.table`(input_duplicates, , `:=`(keep, .summarizeMultiplePSMs(.SD, ... : Coercing 'character' RHS to 'logical' to match the type of target vector."

For now my solution is to manually remove one of the PSMs before pre-processing for input.

Sam
Error_causing_PSMs.csv

Anthony Wu

unread,
Mar 4, 2025, 6:48:31 PMMar 4
to MSstats
Hi,

Your assessment is correct that the data processing is failing because there are duplicate PSMs.  But as far as I know, duplicate PSMs should be summarized at the converter step, so as far as I can tell, there may be a software bug in the converter function.  

To speed up the investigation, would you be able to provide a subset of the PD formatted dataset (e.g. instead of the 2 rows provided in the previous response, all of the rows for protein Q03252 would be better) that is failing along with the annotation file.  

Thanks,
Tony

Sam Siljee

unread,
Mar 5, 2025, 3:44:24 PMMar 5
to MSstats
Hi Tony,

I've filtered the proteomics PSMs to only include  Q03252 and uploaded the input processed by `MSstatsPTM::PDtoMSstatsPTMFormat` (fails with `MSstatsPTM::dataSummarizationPTM_TMT`)
and the same input processed by `MSstatsTMT::PDtoMSstatsTMTFormat` (runs successfully with `MSstatsTMT::proteinSummarization`)

Please let me know if there's anything else you need!

Sam
PTM_preprocessed.rda
annotations.csv
TMT_preprocessed.rda

Anthony Wu

unread,
Mar 12, 2025, 3:06:09 PMMar 12
to MSstats
Hi,

I took a look and I can confirm there is an issue with duplicate PSMs for the same run, notably PSM [R].eNEnGEEEEEEAEFGEEDLFHQQGDPR.[T]_3

Would you be able to provide me the output of the data file from PD (i.e. NOT THE MSSTATS_INPUT TABLE).  I would like to run the PD data file through the MSstatsPTM::PDtoMSstatsPTMFormat function myself to see if there is a bug in the converter function.

Thanks,
Tony

Anthony Wu

unread,
Mar 20, 2025, 12:05:01 PMMar 20
to MSstats
Hi,

Further update - I've determined that the duplicates are coming from a possible bug in the `summaryforMultipleRows` parameter when it is set to `max`.  For now, set that parameter to `sum` and protein summarization should work as expected.

e.g. 
pd_imported = PDtoMSstatsPTMFormat( input, annotation = annot, protein_input = input_protein, annotation_protein = annot_protein,
summaryforMultipleRows = sum, fasta_path = fasta_path, mod_id = "\\(GG\\)", labeling_type = "TMT"
)

The reason why PDtoMSstatsTMTFormat did not create duplicate entries is because by default, the `summaryforMultipleRows` is set to `sum`, whereas in the PDtoMSstatsPTMFormat function, the default for `summaryforMultipleRows` is `max`.  I'll need to further investigate to better understand these differences and fix the bug in the code.

Thanks,
Tony

Sam Siljee

unread,
Mar 31, 2025, 11:04:00 PMMar 31
to MSstats
Hi Tony,

I've been away on leave, so sorry for the delayed response.
The solution of removing the one PSM causing trouble is satisfactory for me.
Here's a link to the output from PD (It's actually from my standard MSstatsTMT analysis, but they should be identical).

Let me know if you need anything else!

Best regards,

Sam
Reply all
Reply to author
Forward
0 new messages