Issues in MSstatsPTM::dataSummarizationPTM

Sam Siljee

unread,

Feb 27, 2025, 2:57:36 PMFeb 27

to MSstats

Hi MSstats team,

I'm running into a problem part-way through the data summarization step of my phopsphoproteomics data.

The protein-level summarization appears to complete just fine for the phosphoproteomics data, and the first 4 replicates of the proteomics data, before crashing halfway through the fifth replicate.

It gives the following error message:

`Warning in dcast.data.table(LABEL + RUN ~ FEATURE, data = input, value.var = "newABUNDANCE", : 'fun.aggregate' is NULL, but found duplicate row/column combinations, so defaulting to length(). That is, the variables [LABEL, RUN, FEATURE] used in 'formula' do not uniquely identify rows in the input 'data'. In such cases, 'fun.aggregate' is used to derive a single representative value for each combination in the output data.table, for example by summing or averaging (fun.aggregate=sum or fun.aggregate=mean, respectively). Check the resulting table for values larger than 1 to see which combinations were not unique. See ?dcast.data.table for more details. <simpleError in .Primitive("length")(newABUNDANCE, keep = TRUE): 2 arguments passed to 'length' which requires 1> Warning in merge.data.table(input[, colnames(input) != "newABUNDANCE", with = FALSE], : Input data.table 'y' has no columns. Error in merge.data.table(input[, colnames(input) != "newABUNDANCE", with = FALSE], : The following columns listed in `by` are missing from y: [FEATURE, RUN, cen] `

These are the same PSM and annotation files I used for TMT proteomics using `MSstatsTMT`, with no problem.

I'm using R 4.4.2, MSstastsPTM 2.8.1, MSstats 4.14.1

I'll see if I can subset and anonymize the data to upload as an example.

Many thanks,

Sam

Sam Siljee

unread,

Feb 27, 2025, 8:42:46 PMFeb 27

to MSstats

Here is a link to my data (peptide sequences and protein names randomised).

I've also reduced the proteomics data to only include the technical replicate that was causing the function to error.

https://drive.google.com/file/d/1HsxHG7bnqe_4uTZ5SdEV3w-p9W312RDX/view?usp=sharing

Sam

Sam Siljee

unread,

Mar 2, 2025, 5:27:30 PMMar 2

to MSstats

After some trial an error, I have narrowed the issue down to two PSMs from the same run in the input. If they are both included, the data processing fails, if either are filtered out first, the data processes successfully.

I've attached a .csv of the two PSMs causing trouble.

I also suspect that the problem may be occurring sooner, as I hadn't noticed that there were suppressed warnings from the `MSstatsPTM::PDtoMSstatsPTMFormat` call. The warning messages are " In `[.data.table`(input_duplicates, , `:=`(keep, .summarizeMultiplePSMs(.SD, ... : Coercing 'character' RHS to 'logical' to match the type of target vector."

For now my solution is to manually remove one of the PSMs before pre-processing for input.

Sam

Error_causing_PSMs.csv

Anthony Wu

unread,

Mar 4, 2025, 6:48:31 PMMar 4

to MSstats

Hi,

Your assessment is correct that the data processing is failing because there are duplicate PSMs. But as far as I know, duplicate PSMs should be summarized at the converter step, so as far as I can tell, there may be a software bug in the converter function.

To speed up the investigation, would you be able to provide a subset of the PD formatted dataset (e.g. instead of the 2 rows provided in the previous response, all of the rows for protein Q03252 would be better) that is failing along with the annotation file.

Thanks,

Tony

Sam Siljee

unread,

Mar 5, 2025, 3:44:24 PMMar 5

to MSstats

Hi Tony,

I've filtered the proteomics PSMs to only include Q03252 and uploaded the input processed by `MSstatsPTM::PDtoMSstatsPTMFormat` (fails with `MSstatsPTM::dataSummarizationPTM_TMT`)

and the same input processed by `MSstatsTMT::PDtoMSstatsTMTFormat` (runs successfully with `MSstatsTMT::proteinSummarization`)

Please let me know if there's anything else you need!

Sam

PTM_preprocessed.rda

annotations.csv

TMT_preprocessed.rda

Anthony Wu

unread,

Mar 12, 2025, 3:06:09 PMMar 12

to MSstats

Hi,

I took a look and I can confirm there is an issue with duplicate PSMs for the same run, notably PSM [R].eNEnGEEEEEEAEFGEEDLFHQQGDPR.[T]_3

Would you be able to provide me the output of the data file from PD (i.e. NOT THE MSSTATS_INPUT TABLE). I would like to run the PD data file through the MSstatsPTM::PDtoMSstatsPTMFormat function myself to see if there is a bug in the converter function.

Thanks,

Tony

Anthony Wu

unread,

Mar 20, 2025, 12:05:01 PMMar 20

to MSstats

Hi,

Further update - I've determined that the duplicates are coming from a possible bug in the `summaryforMultipleRows` parameter when it is set to `max`. For now, set that parameter to `sum` and protein summarization should work as expected.

e.g.

pd_imported = PDtoMSstatsPTMFormat( input, annotation = annot, protein_input = input_protein, annotation_protein = annot_protein,

summaryforMultipleRows = sum, fasta_path = fasta_path, mod_id = "\\(GG\\)", labeling_type = "TMT"

)

The reason why PDtoMSstatsTMTFormat did not create duplicate entries is because by default, the `summaryforMultipleRows` is set to `sum`, whereas in the PDtoMSstatsPTMFormat function, the default for `summaryforMultipleRows` is `max`. I'll need to further investigate to better understand these differences and fix the bug in the code.

Thanks,

Tony

Sam Siljee

unread,

Mar 31, 2025, 11:04:00 PMMar 31

to MSstats

Hi Tony,

I've been away on leave, so sorry for the delayed response.

The solution of removing the one PSM causing trouble is satisfactory for me.

Here's a link to the output from PD (It's actually from my standard MSstatsTMT analysis, but they should be identical).

https://drive.google.com/file/d/1mwPgjZloiRyNUSy0rzOczqWoDjsAxAVm/view?usp=sharing

Let me know if you need anything else!

Best regards,

Sam

Reply all

Reply to author

Forward

Issues in MSstatsPTM::dataSummarizationPTM_TMT

Sam Siljee

Sam Siljee

Sam Siljee

Anthony Wu

Sam Siljee

Anthony Wu

Anthony Wu

Sam Siljee