Input for getWeightedProteinSummary

99 views
Skip to first unread message

Sam Siljee

unread,
Mar 9, 2025, 9:00:47 PMMar 9
to MSstats
Dear all,

I'm looking to integrate MSstatsWeightedSummary in to my TMT and PTM (TMT) workflows, is this the right place to ask?
I'm getting stuck formatting the input, and would appreciate some advice. I've written a function to take my PSMs (from PD) and annotations and format for getWeightedProteinSummary. The only thing I haven't implemented yet is summarisation of fractions.
The clustering and normalization steps seem to run as expected, it just seems like the summarisation doesn't seem to run.
Using the PDtoMSstatsTMT function with the same annotations and PSMs works fine as input for getWeightedProteinSummary. My only problem with this is that it only uses the first protein ID for protein names, even if I set useUniquePeptide to False, and " Protein Accessions" as the Protein identifier.
Is there something else I should be adding to my function, or other options I should be using in the PDtoMSstatsTMT call?

Thank you!

Sam

PDtoMSstatsWeightedSummary.R
PSM_subset.csv

Sam Siljee

unread,
Mar 9, 2025, 9:02:05 PMMar 9
to MSstats
Here is my current PDtoMSstatsTMTFormat call, which works but seems to ignore protein groups:

MSstatsTMT_input <- MSstatsTMT::PDtoMSstatsTMTFormat(
  input = PSMs,
  annotation = annotations,
  which.proteinid = "Protein Accessions",
  useNumProteinsColumn = TRUE,
  useUniquePeptide = FALSE
  rmPSM_withfewMea_withinRun = FALSE, # Default: TRUE
  rmProtein_with1Feature = FALSE,
  summaryforMultipleRows = sum,
  use_log_file = TRUE,
  append = FALSE,
  verbose = TRUE,
  log_file_path = paste0(
    output_dir,
    "/logs/PDtoMSstatsTMTFormat_log_",
    format(Sys.time(), "%Y_%m_%d_%H_%M_%S"),
    ".txt"
  )
)

Sam Siljee

unread,
Mar 9, 2025, 9:05:15 PMMar 9
to MSstats
Here is the error and traceback from the PDtoMSstatsWeightedSummary using the input formatted by my function.

Error:
Error in `[[<-.data.frame`(`*tmp*`, "intercept", value = 1) : replacement has 1 row, data has 0


Traceback:
13: stop(sprintf(ngettext(N, "replacement has %d row, data has %d", "replacement has %d rows, data has %d"), N, nrows), domain = NA) 12: `[[<-.data.frame`(`*tmp*`, "intercept", value = 1) 11: `[[<-`(`*tmp*`, "intercept", value = 1) 10: getProteinSummaryDesign(feature_data) 9: summarizeProteinsClusterSingleRun(input_loop, initial_weights, norm, norm_parameter, use_shared = FALSE) 8: getInitialSummary(input_loop, norm, norm_parameter, initial_summary) 7: getWeightedSummarySingleRun(x, peptide_protein_dt, norm, norm_parameter, weights_mode, tolerance, max_iter, initial_summary, weights_penalty, weights_penalty_param) 6: FUN(X[[i]], ...) 5: lapply(input_by_run, function(x) { getWeightedSummarySingleRun(x, peptide_protein_dt, norm, norm_parameter, weights_mode, tolerance, max_iter, initial_summary, weights_penalty, weights_penalty_param) }) 4: FUN(X[[i]], ...) 3: lapply(cluster_input, function(single_cluster) { input_by_run = split(single_cluster, single_cluster[, Run]) peptide_protein_dt = unique(single_cluster[, .(ProteinName, PSM, Run)]) output_by_run = lapply(input_by_run, function(x) { getWeightedSummarySingleRun(x, peptide_protein_dt, norm, norm_parameter, weights_mode, tolerance, max_iter, initial_summary, weights_penalty, weights_penalty_param) }) summarized_output = data.table::rbindlist(lapply(output_by_run, function(x) x[["summary"]])) alphas_list = lapply(output_by_run, function(x) x[["alpha_history"]]) alpha_diffs = lapply(output_by_run, function(x) x[["convergence_history"]]) list(summary = summarized_output, pp_dt = peptide_protein_dt, alpha_history = alphas_list, convergence_history = alpha_diffs) }) 2: getClusterSummaries(cluster_input, norm, norm_parameter, weights_mode, tolerance, max_iter, initial_summary, weights_penalty, weights_penalty_param) 1: getWeightedProteinSummary(MSstatsTMT_input, norm = "p_norm", norm_parameter = 1, weights_mode = "contributions", tolerance = 0.1, max_iter = 10, initial_summary = "unique", weights_penalty = FALSE, weights_penalty_param = 0.1, save_weights_history = FALSE, save_convergence_history = FALSE)


Sam Siljee

unread,
Mar 9, 2025, 10:57:04 PMMar 9
to MSstats
Sorry, forgot to include my annotations, although there's only one run included in the test PSMs
annotations.rda

Mateusz Staniak

unread,
Mar 10, 2025, 6:07:04 AMMar 10
to MSstats
Hi,

thanks for reaching out.
Regarding data pre-processing, there are is no standard way to do this, yet, as MSstats and most other tools mainly used peptides uniquely assigned to protein labels generated by external tools. So you need a bit of extra work in addition to the MSstats converter.

Right now I don't have an example of processing data with fractions, but I do have an example for PD input: https://github.com/mstaniak/MWS_reproduction/blob/main/05_tpp_onepot_data_preparation.R
In principle, it should be the same as with unique peptides, except we need to:
- keep shared peptides, 
- decide on filtering (for example, MSstats has filters like "keep proteins with at least two features" - should it include shared peptides?),
- (if normalization is required) apply normalization from the MSstatsWeightedSummary package (it makes sure to use intensities of shared peptides only once),
- make sure there is a row for every match between a shared peptide and proteins that match to it.
Regarding the last part, it is a bit tricky in general (due to protein grouping that some tools do), but when PD outputs all matching protein IDs, it can be done easily like I did in the file I linked (lines 52-75).

That being said, I will check what is going wrong in this case.


Kind regards,
Mateusz

Mateusz Staniak

unread,
Mar 10, 2025, 6:29:39 AMMar 10
to MSstats
Hi,


please find attached a script with example processing of your data.
I don't know if this is true in the entire dataset, but in this subset groups of isoforms are all identified by the same shared peptide, which is not enough for a meaningful weighted summarization. Hence, the last object and plot in my script only shows unique peptides. I will be happy to further assist you with the weighted summary.


Kind regards,
Mateusz

Sam Siljee

unread,
Mar 10, 2025, 3:31:42 PMMar 10
to MSstats
Hi Mateusz,

Thank you for helping out! In theory, my script should be covering all of the points from your first message apart from the filtering. Currently I'm keeping all peptides, but I may address this again later.
The column names should all be correct, and adding the clusters and normalising seems to work as expected. Should I then be using MSstatsConvert to finish preparing the data for getWeightedProteinSummary?
The script doesn't seem to have attached to your last message unfortunately!

Kind regards,
Sam

Mateusz Staniak

unread,
Mar 11, 2025, 8:19:27 PMMar 11
to MSstats
Hi,


sorry, I hope this time the file is there and it helps.


Kind regards,
Mateusz Staniak
pd_test_shared.R

Sam Siljee

unread,
Mar 13, 2025, 3:49:48 PMMar 13
to MSstats
Hi Mateusz,

Thank you for sending the script through, it works nicely with my test data! I think I was almost there based on your reproduction scripts, I was just getting tied up in the extra processes!
Unfortunately when I try to process my whole dataset, R crashes with a fatal error (no error message unfortunately). I've added in some steps to clear the memory and remove variables no longer needed which will hopefully help it to run successfully.I have also enclosed all of the code in a function so I couldn't tell at which point it was crashing. I'll let you know how I get on.

Best,
Sam

Mateusz Staniak

unread,
Mar 17, 2025, 11:27:44 AMMar 17
to MSstats
Hi,


does it crash while pre-processing the data or while fitting the weighted model?


Kind regards,
Mateusz Staniak

Sam Siljee

unread,
Mar 31, 2025, 10:47:02 PMMar 31
to MSstats
Hi Mateusz,

I've been away on leave - sorry for not responding sooner. The crash occurs while fitting the weighted model, the pre-processing appears to run smoothly. There's no error, as the R session itself is aborted with the message "R encountered a fatal error. The session was terminated."
I've attached a script of my current code.

Sam
Final_TMT_MSstats_analysis.Rmd

Mateusz Staniak

unread,
Apr 1, 2025, 11:06:34 AMApr 1
to MSstats
Hi,


this could be related to the size of some clusters of proteins. After pre-processing the data, please try summarizing separately for each cluster by looping. And before that, it would be useful to check what are the largest numbers of proteins per cluster and if there are any peptides that match to, say, 10s-100s of proteins. That may happen and it's likely too much for the optimization routine.


Kind regards,
Mateusz

Sam Siljee

unread,
Apr 1, 2025, 7:10:11 PMApr 1
to MSstats
Hi Mateusz,

Thanks for the suggestions!
The range of the "NumProteinsPerPeptide" column is 1 to 22, however after filtering the input for "NumProteinsPerPeptide" of 1 it still crashes.
I think you're right about the size of the clusters being the problem, the number of observations in each cluster ranges from 18 to 464,094 (I'm using TMTPro 18-plex).
There still seems to be something else going on however, as a cluster of 720 observations causes R to crash, but there are other larger clusters that seems to process just fine.
I can't share my original full dataset, but I'll make a dataset of randomised protein names and upload that.

Sam

Rplot.png

Sam Siljee

unread,
Apr 1, 2025, 8:39:07 PMApr 1
to MSstats
I don't think cluster size is an issue, I've finished some more testing, and the cluster with 464094 observations still managed to process just fine.
I've not yet thought of a way to systematically search through all of the data to identify which clusters are not processing, especially as R crashes rather than stopping with an error.

Sam

Sam Siljee

unread,
Apr 1, 2025, 9:49:53 PMApr 1
to MSstats
With some trial and error, I've some example pre-processed input data.
Three clusters in each, one seems to run fine, one causes the loop to stop with an error (looping through individual clusters), and one which causes R to crash.
Let me know if you'd like a larger set of examples.


Sam

Mateusz Staniak

unread,
Apr 7, 2025, 10:37:01 AMApr 7
to MSstats
Hi,


apologies for my delayed response, I will check this as soon as I can. Thank you for providing the example data.


Kind regards,
Mateusz

Mateusz Staniak

unread,
Apr 16, 2025, 5:02:42 AMApr 16
to MSstats
Hi,


I'm pretty sure it's because of proteins with identical sets of features in a run. Did you try the processIsoforms function? It has an option to filter or merge such cases. It can be done manually too.


Kind regards,
Mateusz

Sam Siljee

unread,
Apr 23, 2025, 12:16:52 AMApr 23
to MSstats
Hi Mateusz,

I'm currently running the processIsoforms function as follows:

# Optional steps:
# merging or filtering isoforms if there is not enough feature-level information
procd_iso <- processIsoforms(procd, T, T, F)

I'm not able to get a help file for this function, do you have a suggestion for which of the arguments I should change? In the meantime, I'm repeating the Proteome Discovere search using a database that doesn't include isoforms to try that also.

Sam

Sam Siljee

unread,
Apr 27, 2025, 11:05:26 PMApr 27
to MSstats
Hi Mateusz,
I've had a look at duplicates in the data causing R to crash. When I select ProteinName, PSM, TechRepMixture, BioReplicate, Condition, Intensity, log2Intensity, and log2IntensityNormalized, then look at which rows are identical I notice a pattern. All the rows have NA values for intensity and derive values, and they are all in duplicates of TMT channels 126 and 135N. These are the channels I used for normalisation, this might help point to where the crash may be coming from.
When I filter out the channels 126 and 135N from the example dataset that causes R to crash, the getWeightedProteinSummary function doesn't run (it errors as with the dataset that stops with an error), but R no-longer crashes.
When I filter out the proteins containing duplicates, the getWeightedProteinSummary function causes R to crash like before. Still no closer unfortunately, I'll keep digging

Sam

Mateusz Staniak

unread,
Apr 28, 2025, 6:03:34 AMApr 28
to MSstats
Hi,


thanks for the update, I'll look into this issue with channels as well. Your processIsoforms call seems OK. The last argument set to TRUE would remove subset proteins, but I don't think that's the issue here.


Kind regards,
Mateusz

Sam Siljee

unread,
Apr 29, 2025, 10:57:57 PMApr 29
to MSstats
I've also tried going back to using MSstatsTMT::PDtoMSstatsTMTFormat to format the PSM data before doing normalisation and adding clusters. I think this is likely the most straightforward way of dealing with merging my 8 fractions (in case the different fractions are what's causing the problem). This unfortunately hasn't fixed the problem, but I get a new error running `processIsoforms`: " Error in graph_decomposed[[cluster_id]] : subscript out of bounds "
I've tried skipping the `processIsoforms` step, but this doesn't fix it. I've also repeated the PD search using a database without isoforms as a workaround, but this doesn't seem to help either.

Sam

Mateusz Staniak

unread,
May 5, 2025, 3:00:54 AMMay 5
to MSstats
Hi,

using PDtoMSstatsTMTFormat should be OK, as long as it doesn't remove the shared peptides.
I was able to get a summary on the "Input_error" sample data set after removing the duplicates. Where did they come from in the processing?


Kind regards,
Mateusz

Mateusz Staniak

unread,
May 5, 2025, 3:17:21 AMMay 5
to MSstats
Hi,


quick update, my mistake: it wasn't just duplicates, I also updated the way I re-merge information on peptide-protein matches to quant data during summarization. Please install version 0.99.7 from the devel branch of the GitHub repository and see if it fixes your issue.


Kind regards,
Mateusz

Sam Siljee

unread,
May 5, 2025, 4:17:54 AMMay 5
to MSstats
Hi Mateusz,

Thanks for your help!
I won't be able to run anything until tomorrow unfortunately, but I just thought I'd clarify something.
I checked my session info, and apparently I already had version 0.99.7 of MSstatsWeightedSummary installed (I installed it quite a while back, so I doubt it has any recent changes). Also, on the Github it appears that the master branch was updated an hour ago, and the devel branch updated 2 months ago. Should I install from the master branch?

Sam

Sam Siljee

unread,
May 5, 2025, 9:15:07 PMMay 5
to MSstats
I think the duplicates come from splitting the protein groups.
The output of PD comes in wide format. The protein groups are shown by uniprot ID separated by semicolons. I split the rows of protein groups, duplicating the channel intensities and other information into new rows with each protein in the group as its own row. Hopefully that makes sense!

Sam

Mateusz Staniak

unread,
May 19, 2025, 4:52:22 AMMay 19
to MSstats
Hi,

please use the current devel branch. The version is 0.99.15.
Regarding the duplicates, I think I also encountered them. Running unique() on converter output should fix the issue.

Kind regards,
Mateusz Staniak
Reply all
Reply to author
Forward
0 new messages