big dataset

288 views
Skip to first unread message

froehlic...@gmail.com

unread,
Sep 18, 2020, 8:00:23 AM9/18/20
to MSstats
Hi everyone,
I currently am analyzing 92 runs with DIA measurements with ~ 6500 proteins IDed per run.
the dataprocess step is taking incredibly long and after 3hours progress is at 5% memory usage is at 38gb of ram...... 
Could I split up the dataprocess() function somehow? I will let you know whether it runs through but I will have to repeat this step several times and if I could speed this up or perform the analysis on a normal computer this would immensely help!
Best, Klemens

Mateusz Staniak

unread,
Sep 21, 2020, 6:59:24 PM9/21/20
to MSstats
Hi,

the dataProcess function does several things and at each step there's something you can do to improve the running time. I'll assume there's no fractionation in your data (if there is, only some details change):

 - dataProcess requires that there is a row for each feature in each run. If your input data satisfy this requirement, there will be no need for potentially lengthy computations,

- then, dataProcess performs normalization. If you can normalize data in an efficient way outside of dataProcess, it should speed it up significantly (just set normalization = "NONE"),

- selecting features for summarization via "highQuality" method can take some time - if you want speed, choose "topN" with a low N (3 is the default for this method, but by default dataProcess uses all features),

- another operation is finding values that will be considered as censored. You can set maxQuantileforCensored = NULL to skip this (and perhaps do it outside of dataProcess function),

- with data in the right format (normalized, with a row for every run and feature etc), you can run dataProcess for each protein separately or for batches of proteins and parallelize your code.


Btw we will be releasing a new version of MSstats in October. The new version will be easier to work with for big datasets


Kind regards,
Mateusz Staniak

froehlic...@gmail.com

unread,
Sep 22, 2020, 5:09:57 AM9/22/20
to MSstats
Hi  Mateusz ,

Thank you for your input!  2 more quick questions:

I will try to adjust the params and perform the steps you described outside of dataprocess(). 

Is the new version already available via github by any chance? 

 " dataProcess requires that there is a row for each feature in each run " 
Could you please shortly comment on that? Because the 10column format input requirements in the manuals suggest that every transition has an own row.
If I only have one feature per precursor, what should I fill the transition column with? NA or artificial transition?

Best, Klemens

Mateusz Staniak

unread,
Sep 22, 2020, 5:21:41 AM9/22/20
to MSstats
Hi,


datatProcess creates three columns from data in the standard 10-column format:
- peptide = PeptideSequence + PrecursorCharge
- transition = FragmentIon + ProductCharge
- feature = peptide + transition

If your features are the same as peptide, FragmentIon and ProductCharge columns should have a single value (can be NA - that's how converter functions XtoMSstatsFormat fill these columns in).
With features defined this way, dataProcess make sure that each feature has a row for each run by filling in Intensity=NA if the row was missing.

The new version of the package is in development on Github at Vitek-Lab/MSstatsConvert and Vitek-Lab/MSstats-dev but the changes are not documented properly, yet, and it's still under heavy testing, unfortunately.

Kind regards,
Mateusz

froehlic...@gmail.com

unread,
Oct 27, 2020, 7:56:12 AM10/27/20
to MSstats
Dear Mateusz,
Thank you for the answer!

I can get the desired format and the dataprocess also seems to work nicely. However, when I perform a statistical test with a contrast matrix I only get NAs for logFCs and pValues.
Could you please take a quick look what I have done wrong? 
I have tried to create a reproducible error with a small subset of peptides. I suspect I did something wrong with the annotation or the contrast matrix but I cannot find the error.

Please find attached the work environment containing my 10column format input and the log of the msstats comparison (it only says error @ protein comparison but not what kind)

here is the code I use to generate the contrast matrix and perform the statistical test:

#protein summarization
quant_DIANN_prot <- MSstats::dataProcess(raw = subset_msstats)

#contrast.matrix
comparison <-  data.frame(V1 = c(1,-1),
                          V2 = c(-1,0),
                          V3 = c(0,1),
                          V4 = c(0,0)
)
colnames(comparison) <- levels(quant_DIANN_prot$ProcessedData$GROUP_ORIGINAL)
rownames(comparison) <- c("LE 1-12 vs 1-25", "LE 1-6 vs 1-12")

compare_prot <- MSstats::groupComparison(contrast.matrix = as.matrix(comparison), data = quant_DIANN_prot)

It would be great if you have any ideas, because I have been stuck at this stage for 2 weeks now and nothing I have tried worked so far.

Best, Klemens

msstats-25.log
reproducible_error.RData

Mateusz Staniak

unread,
Oct 27, 2020, 10:05:11 AM10/27/20
to MSstats
Dear Klemens,



thanks for providing the data - I can reproduce the problem. I will check what is wrong and get back to you.



Best,
Mateusz

froehlic...@gmail.com

unread,
Nov 6, 2020, 8:16:31 AM11/6/20
to MSstats
Hi Mateusz,
Any updates on the matter? I tried a few different things in the meantime with the annotation but so far nothing seems to work.
I really appreciate your help in the matter!
Best, Klemens

Mateusz Staniak

unread,
Nov 10, 2020, 6:54:53 AM11/10/20
to MSstats
Hi,


sorry for taking so long,
I finally found the problem. It is about the Subject 13 (BioReplicate = 13). It belongs to two different conditions. If you remove this BioReplicate before running dataProcess, you'll get valid results.
Let me know if you have more questions.


Best,
Mateusz Staniak

froehlic...@gmail.com

unread,
Nov 11, 2020, 1:15:07 PM11/11/20
to MSstats
Hi Mateusz,
Thank you so much! It works now! I based the conditions and the bioreplicates on the file names and 13 was in there twice. I am so sorry for taking up your time!
Best, Klemens

froehlic...@gmail.com

unread,
Nov 13, 2020, 6:35:56 AM11/13/20
to MSstats
Hi Mateusz,
The whole analysis pipeline is running through now and gives me excellent results! Thank you so much for the support!

However, I noticed something really weird:
When I ran the msstats:dataProcess() function, this took around 3 hours. Not a problem if I only have 1 analysis and dont want to iterate different settings of the function. But afterwards R was really slow..... 2+2 taking 20 seconds to process.
I saved the workspace, closed R, reloaded the workspace and R was responding normally again
I then performed the  MSstats::groupComparison() which ran through in ~ 1 hour.
I saved the workspace again in order to continue working from here in the future. 
This took several hours (I let it run overnight) and the next morning I discovered that the workspace is now about 375GB.

If you want I could try to replicate this on another machine or do you need any additional information from my session / hardware?
Has this happened in the past or is this "normal"?

Best, Klemens

Mateusz Staniak

unread,
Nov 13, 2020, 7:09:18 AM11/13/20
to MSstats
Hi,


glad to hear the pipeline is running,

regarding the memory: I'm not sure there's anything that can be done about it... It might be a general problem with R memory management (I do experience this from time to time even with smaller datasets).  It could depend on a combination of: dataset size, operating system, other processes running on that machine. How big is your data in GB? What's your OS? How big is the output from dataProcess? You can check it with the pryr::object_size function. What about the groupComparison output? Let me know and I'll see what I can do about it.


Best,
Mateusz

froehlic...@gmail.com

unread,
Nov 14, 2020, 9:33:46 AM11/14/20
to MSstats
  Hi Mateusz,

I start from the top:
Here is my complete R environment: 

> pryr::object_size(compare_prot)
1.89 GB
> pryr::object_size(for_msstats_prot)
724 MB
> pryr::object_size(main_output2)
2.59 GB
> pryr::object_size(quant_DIANN_prot)
1.07 GB
> pryr::object_size(comparison)
1.3 kB

main_output2 is the output of DIA software
for_msstats_prot is 10 column format with necessary bioreplicate etc info
quant_DIANN_prot is the output of  msstats:dataProcess(for_msstats_prot)
compare_prot <- MSstats::groupComparison(contrast.matrix = as.matrix(comparison), data = quant_DIANN_prot)
comparison is the comparison matrix for groupcomparison()


OS: win10 
16 GB ram, 
i7, 9750HF (6 cores, 12 threads @ 4.2GHz boost)
some background processes run at the same time, email + browser + spotify
I will try to replicate the problem on another machine now, with latest RStudio, R and msstats

I saved the workspace after dataprocess()  (workspaceModifiedSequence1 800MB) and after groupcomparison()   (workspaceModifiedSequence2 375GB)

I would say my environment should only be about ~ 5GB ?

 I could also share the dataset if you want and the R Script

Thanks for your time!
Klemens

Mateusz Staniak

unread,
Nov 17, 2020, 5:03:33 AM11/17/20
to MSstats
Hi,


this is indeed strange,
please send me the R script, too, and I'll see if I can find the problem.



Kind regards,
Mateusz

froehlic...@gmail.com

unread,
Nov 19, 2020, 8:43:17 AM11/19/20
to MSstats
Hi Mateusz,
I cannot attach R Scripts in here. I sent you the R Script directly.
I was also able to reproduce the problem on another machine with Windows server with 256GB ram
workspace is again 375GB after the groupcomparison()

Best, Klemens

Mateusz Staniak

unread,
Apr 8, 2021, 6:51:58 AM4/8/21
to MSstats
Hi,


the new version will have an option not to save fitted models. Currently, every model fitted for groupComparison is saved in the output object and they drive its size in memory.



Kind regards,
Mateusz
Reply all
Reply to author
Forward
0 new messages