Poor processing speed with 515 LC-MS/MS file with over 2000 protein during dataProcess step

25 views
Skip to first unread message

PARTHIBAN PERIASAMY

unread,
Sep 8, 2025, 4:54:45 AMSep 8
to MSstats

I’m analyzing 515 LC-MS/MS files (FragPipe → MSstats). Below is the current workflow for dataProcess, diagnostic plots, and protein–sample matrix generation:

# ---- Step 1: Normalize & summarize ---- processedData <- dataProcess( raw = formattedData, logTrans = 2, summaryMethod = "TMP", # faster than "linear" normalization = "equalizeMedians", MBimpute = impute, # TRUE/FALSE remove50missing = FALSE, censoredInt = "NA" ) # ---- Step 2: Generate plots ---- plot_types <- c("QCPlot", "ProfilePlot", "ConditionPlot") for (ptype in plot_types) { dataProcessPlots( processedData, type = ptype, text.angle = ifelse(ptype == "QCPlot", 45, 0), width = 120, height = 60, address = file.path(output_dir, paste0("Report_", ptype, "_", suffix)) ) } # ---- Step 3: Protein × Sample matrix ---- protein_df <- build_protein_matrix(processedData$ProteinLevelData) matrix_file <- file.path(output_dir, paste0("Protein_Matrix_", suffix, ".csv")) readr::write_csv(protein_df, matrix_file, na = "") cat("✔ Completed:", suffix, " (plots + protein matrix)\n")

Unfortunately, processing is slow and frequently stalls at:

INFO [2025-09-08 12:53:36] == Start the summarization per subplot... |================================ | 20%

Could you advise on a more efficient way to code this to improve throughput?

Regards,
Ben

Screenshot 2025-09-08 165419.png

Devon Kohler

unread,
Sep 8, 2025, 9:50:06 AMSep 8
to MSstats
Hi Ben,

The best way to speed up the code is to use top-N feature filtering (especially for DIA). In dataProcess, I would recommend setting  `featureSubset = "topN"` and   `n_top_feature = 30` (you can go higher if you'd like. Generally I wouldn't recommend setting this higher than 100). 

Another easy way to speed up the code is to use parallel processing. You need to set numberOfCores in dataProcess to however many cores you would like to use. Note to see the progress using parallel processing you will need to look at the log file.

Examples of these options are outlined in the protocol here: https://www.nature.com/articles/s41596-024-01000-3

Let me know if you are still running into problems with these options.

Best,
Devon
Message has been deleted

PARTHIBAN PERIASAMY

unread,
Sep 8, 2025, 10:46:29 AMSep 8
to MSstats

Hi Devon,

Thank you for your quick response. Unfortunately, I’m encountering issues with MSstatsBig when using the Arrow backend. Below is the relevant code snippet and error message:

# ===========================

# MSstatsBig

# ===========================

formattedData <- bigFragPipetoMSstatsFormat(

  input_file = raw_file,

  output_file_name = NULL,  # ← no CSV writing

  backend = "arrow",

  max_feature_count = TOP_N,

  filter_unique_peptides = FALSE,

  aggregate_psms = FALSE,

  filter_few_obs = FALSE

) %>%

  dplyr::collect()  # materialize to a data.frame for MSstats

+++++++++++++++++++++++++++++++++++++++++++++++++++++


Error:

Error in arrow::write_csv_arrow(): ! x must be an object of class 'data.frame', 'RecordBatch', 'Dataset', 'Table', or 'RecordBatchReader' not 'arrow_dplyr_query'. Run rlang::last_trace() to see where the error occurred.

  Could you please advise on how to resolve this?


Regards,

Ben 

Mateusz Staniak

unread,
Sep 8, 2025, 11:12:43 AMSep 8
to MSstats
Hi,


Devon's advice concerns the dataProcess function from MSstats package. Arrow backend version does not adjust for interface changes in other packages, hence that error. I have a spark version but I won't be able to include it MSstatsBig [GitHub version] until next month. However, both solutions only help re-format very large input data into a format digestible for the dataProcess function. The biggest bottleneck there is missing value imputation, which can be terribly slow with a large number of features. Hence, the fastest way to pre-process data for analysis with MSstats at the moment is to restrict the number of features per protein (and/) or parallelize directly in MSstats::dataProcess.

Kind regards,
Mateusz
Reply all
Reply to author
Forward
0 new messages