Hi,
the dataProcess function does several things and at each step there's something you can do to improve the running time. I'll assume there's no fractionation in your data (if there is, only some details change):
- dataProcess requires that there is a row for each feature in each run. If your input data satisfy this requirement, there will be no need for potentially lengthy computations,
- then, dataProcess performs normalization. If you can normalize data in an efficient way outside of dataProcess, it should speed it up significantly (just set normalization = "NONE"),
- selecting features for summarization via "highQuality" method can take some time - if you want speed, choose "topN" with a low N (3 is the default for this method, but by default dataProcess uses all features),
- another operation is finding values that will be considered as censored. You can set maxQuantileforCensored = NULL to skip this (and perhaps do it outside of dataProcess function),
- with data in the right format (normalized, with a row for every run and feature etc), you can run dataProcess for each protein separately or for batches of proteins and parallelize your code.
Btw we will be releasing a new version of MSstats in October. The new version will be easier to work with for big datasets
Kind regards,
Mateusz Staniak