MSstats error

154 views
Skip to first unread message

komw...@gmail.com

unread,
May 15, 2023, 12:43:46 PM5/15/23
to MSstats
Hello MSstats team

I have a quick question about MSstats normalization process.
I processed to do median equalize before group comparison function in MSstats and It gave me this error

> srm.equalmed <- dataProcess(SRMRawData, normalization = 'equalizeMedians') INFO [2023-05-16 01:04:03] ** Features with one or two measurements across runs are removed. INFO [2023-05-16 01:04:03] ** Fractionation handled. INFO [2023-05-16 01:04:03] ** Updated quantification data to make balanced design. Missing values are marked by NA INFO [2023-05-16 01:04:03] ** Log2 intensities under cutoff = 5.4378 were considered as censored missing values. INFO [2023-05-16 01:04:03] ** Log2 intensities = NA were considered as censored missing values. INFO [2023-05-16 01:04:03] ** Use all features that the dataset originally has. INFO [2023-05-16 01:04:03] # proteins: 65 # peptides per protein: 1-1 # features per peptide: 3-5 INFO [2023-05-16 01:04:03] Disease Healthy # runs 51 36 # bioreplicates 51 36 # tech. replicates 1 1 INFO [2023-05-16 01:04:03] Some features are completely missing in at least one condition: EHVAHLLFLR_3_y3_1, NA ... INFO [2023-05-16 01:04:03] == Start the summarization per subplot... |==================================== | 28%Aggregate function missing, defaulting to 'length' <simpleError in .Primitive("length")(newABUNDANCE, keep = TRUE): 2 arguments passed to 'length' which requires 1> INFO [2023-05-16 01:04:03] == Summarization is done. Error in merge.data.table(input[, colnames(input) != "newABUNDANCE", with = FALSE], : Elements listed in `by` must be valid column names in x and y In addition: Warning messages: 1: In survreg.fit(X, Y, weights, offset, init = init, controlvals = control, : Ran out of iterations and did not converge 2: In merge.data.table(input[, colnames(input) != "newABUNDANCE", with = FALSE], : You are trying to join data.tables where 'y' argument is 0 columns data.table.

how can I solve it?
I also attached my Skyline output data
MSstatsInput5.csv

Mateusz Staniak

unread,
May 15, 2023, 12:56:23 PM5/15/23
to MSstats
Hi,

I wasn't able to reproduce the issue. Did you apply the converter function before running dataProcess?

My code:
sli = readr::read_csv("MSstatsInput5.csv")
sli_conv = MSstats::SkylinetoMSstatsFormat(sli)
unique(sli_conv$IsotopeLabelType)

sli_dp = MSstats::dataProcess(sli_conv) # no error
sessionInfo()

R version 4.2.1 (2022-06-23) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.10 Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.1 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.1 locale: [1] LC_CTYPE=pl_PL.UTF-8 LC_NUMERIC=C LC_TIME=pl_PL.UTF-8 LC_COLLATE=pl_PL.UTF-8 LC_MONETARY=pl_PL.UTF-8 [6] LC_MESSAGES=pl_PL.UTF-8 LC_PAPER=pl_PL.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=pl_PL.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] MSstats_4.6.5 gtools_3.9.4 tidyselect_1.2.0 splines_4.2.1 lattice_0.20-45 log4r_0.4.3 [7] colorspace_2.1-0 vctrs_0.6.1 generics_0.1.3 utf8_1.2.3 survival_3.4-0 marray_1.76.0 [13] rlang_1.1.0 pillar_1.9.0 nloptr_2.0.3 glue_1.6.2 withr_2.5.0 bit64_4.0.5 [19] lifecycle_1.0.3 munsell_0.5.0 gtable_0.3.3 caTools_1.18.2 tzdb_0.3.0 parallel_4.2.1 [25] fansi_1.0.4 preprocessCore_1.60.2 Rcpp_1.0.10 KernSmooth_2.23-20 readr_2.1.3 scales_1.2.1 [31] backports_1.4.1 checkmate_2.1.0 limma_3.54.1 vroom_1.6.1 bit_4.0.5 lme4_1.1-31 [37] gplots_3.1.3 ggplot2_3.4.2 hms_1.1.2 stringi_1.7.12 dplyr_1.1.1 ggrepel_0.9.2 [43] grid_4.2.1 MSstatsConvert_1.9.3 cli_3.6.1 tools_4.2.1 bitops_1.0-7 magrittr_2.0.3 [49] tibble_3.2.1 crayon_1.5.2 pkgconfig_2.0.3 ellipsis_0.3.2 MASS_7.3-58.1 Matrix_1.5-4 [55] data.table_1.14.8 minqa_1.2.5 rstudioapi_0.14 R6_2.5.1 boot_1.3-28 nlme_3.1-159 [61] compiler_4.2.1


Kind regards,
Mateusz

Xinle Tan

unread,
Feb 13, 2024, 8:04:58 AM2/13/24
to MSstats
Hi Mateusz:

I ran into the same error as the author using my own dataset. then I used the input file from this author and ran into exact the same error.

my code:
library(MSstats) #imports msstats
ms <- read.csv('MSstatsInput5.csv', sep=",")
head(ms) #shows column names
QuantData <- MSstats::dataProcess(ms) #processes the data

> ms <- read.csv('MSstatsInput5.csv', sep=",")
> head(ms) #shows column names
            ProteinName  PeptideSequence PeptideModifiedSequence PrecursorCharge FragmentIon ProductCharge IsotopeLabelType Run Intensity Condition
1  sp|P00748|FA12_HUMAN         TEQAAVAR                TEQAAVAR               2          b3             1            heavy   4     37315   Healthy
2 sp|Q06033|ITIH3_HUMAN       DYIFGNYIER              DYIFGNYIER               2          b3             1            heavy   4     43777   Healthy
3  sp|P02743|SAMP_HUMAN           VFVFPR                  VFVFPR               2          b3             1            heavy   4     52653   Healthy
4   sp|P00740|FA9_HUMAN      NCELDVTCNIK   NC[+57]ELDVTC[+57]NIK               2          b3             1            heavy   4     58457   Healthy
5  sp|P00488|F13A_HUMAN GTYIPVPIVSELQSGK        GTYIPVPIVSELQSGK               2          b3             1            heavy   4     58919   Healthy
6  sp|P22891|PROZ_HUMAN         GLLSGWAR                GLLSGWAR               2          b3             1            heavy   4     79872   Healthy
  BioReplicate
1            4
2            4
3            4
4            4
5            4
6            4
> QuantData <- MSstats::dataProcess(ms) #processes the data
INFO  [2024-02-13 23:02:56] ** Features with one or two measurements across runs are removed.
INFO  [2024-02-13 23:02:56] ** Fractionation handled.
INFO  [2024-02-13 23:02:56] ** Updated quantification data to make balanced design. Missing values are marked by NA
INFO  [2024-02-13 23:02:57] ** Log2 intensities under cutoff = 5.4378  were considered as censored missing values.
INFO  [2024-02-13 23:02:57] ** Log2 intensities = NA were considered as censored missing values.
INFO  [2024-02-13 23:02:57] ** Use all features that the dataset originally has.
INFO  [2024-02-13 23:02:57]
 # proteins: 65
 # peptides per protein: 1-1
 # features per peptide: 3-5
INFO  [2024-02-13 23:02:57]
                    Disease Healthy
             # runs      51      36
    # bioreplicates      51      36
 # tech. replicates       1       1
INFO  [2024-02-13 23:02:57] Some features are completely missing in at least one condition:  
 EHVAHLLFLR_3_y3_1,
 NA ...
INFO  [2024-02-13 23:02:57]  == Start the summarization per subplot...
  |========================================                                                                                                       |  28%Aggregate function missing, defaulting to 'length'

<simpleError in .Primitive("length")(newABUNDANCE, keep = TRUE): 2 arguments passed to 'length' which requires 1>
INFO  [2024-02-13 23:02:57]  == Summarization is done.
Error: Elements listed in `by` must be valid column names in x and y

In addition: Warning messages:
1: In survreg.fit(X, Y, weights, offset, init = init, controlvals = control,  :
  Ran out of iterations and did not converge
2: Input data.table 'y' has no columns. 

below is my R version:
> R.version
               _                          
platform       x86_64-apple-darwin20      
arch           x86_64                      
os             darwin20                    
system         x86_64, darwin20            
status                                    
major          4                          
minor          3.2                        
year           2023                        
month          10                          
day            31                          
svn rev        85441                      
language       R                          
version.string R version 4.3.2 (2023-10-31)
nickname       Eye Holes  

Has anyone solved this problem?

Mateusz Staniak

unread,
Feb 13, 2024, 8:16:15 AM2/13/24
to MSstats
Hi,


can you share a small subset of data that reproduces this issue? Data can be anonymized by changing protein, run, condition labels. As long as the error is reproduced, one protein might be enough.


Kind regards,
Mateusz

Xinle Tan

unread,
Feb 13, 2024, 9:12:32 AM2/13/24
to MSstats
Hi Mateusz:

I tried subsetting only a few proteins to represent the dataset. and the error is not there anymore!

apparently there is something going on with a few proteins. how would I know which ones were wrong? I have 5000 proteins and the dataset is 260 MB.

Xinle Tan

unread,
Feb 13, 2024, 9:41:13 AM2/13/24
to MSstats
Hi Mateusz:

I found one protein that can reproduce the error here.

greatly appreciated if you can find out what is happening.
A0A8V1ABS2.csv

Mateusz Staniak

unread,
Feb 14, 2024, 6:34:56 AM2/14/24
to MSstats
Hi, the problem is most likely due to repeated rows, for example:

...1 X ProteinName PeptideSequence PrecursorCharge FragmentIon ProductCharge IsotopeLabelType 1: 93193 93192 A0A8V1ABS2 VKQLPLVKPYLR 2 y8 1 L 2: 93194 93193 A0A8V1ABS2 VKQLPLVKPYLR 2 y8 1 L Condition BioReplicate Run Intensity 1: A A_1 1 10165.74 2: A A_1 1 10165.74

did you run a converter before summarization? I'm curious how exactly these rows avoided aggregation step. As a solution, please either remove the first two columns and use unique() on resulting table, or aggregate data per protein, feature.

Kind regards,
Mateusz

Xinle Tan

unread,
Feb 14, 2024, 7:03:42 AM2/14/24
to MSstats
Hi Mateusz:

no I did not run a converter. we used PeakView for peptides quantitation and I cannot see a converter function from peakview output to MSstats, unless I have missed it?

therefore i wrote a script myself to convert the wide format to long format.

following your suggestion, I have removed the duplciate rows and ran the dataprocess again protein by protein. but it looks like this particular protein (A0A8V1ABS2) still ran into the same error. below is my code:

library(MSstats) #imports msstats
ms <- read.delim('Norman_jejunum_reformat_20240214.txt',sep = '\t')
head(ms) #shows column names
ms = ms[!duplicated(ms),]
test_list = unique(ms$ProteinName)
for (p in test_list) {
  ms1 <- ms[ms$ProteinName == p,]
  print(p)
  QuantData <- MSstats::dataProcess(ms1)
}

the first few proteins were fine and then the error popped up, and here is the last few rows of the output.

[1] "A0A8V1ABS2"
INFO  [2024-02-14 21:55:44] ** Features with one or two measurements across runs are removed.
INFO  [2024-02-14 21:55:44] ** Fractionation handled.
INFO  [2024-02-14 21:55:44] ** Updated quantification data to make balanced design. Missing values are marked by NA
INFO  [2024-02-14 21:55:44] ** Log2 intensities under cutoff = 7.8211  were considered as censored missing values.
INFO  [2024-02-14 21:55:44] ** Log2 intensities = NA were considered as censored missing values.
INFO  [2024-02-14 21:55:44] ** Use all features that the dataset originally has.
INFO  [2024-02-14 21:55:44]
 # proteins: 1
 # peptides per protein: 59-59
 # features per peptide: 4-6
INFO  [2024-02-14 21:55:44]
                     A  B
             # runs 10 10
    # bioreplicates 10 10

 # tech. replicates  1  1
INFO  [2024-02-14 21:55:44] Some features are completely missing in at least one condition:  
 ANVPN[Dea]KVIQC[PPa]FAETGQVQK_3_y10_1,
 ANVPN[Dea]KVIQC[PPa]FAETGQVQK_3_y12_1,
 ANVPN[Dea]KVIQC[PPa]FAETGQVQK_3_y16_2,
 ANVPN[Dea]KVIQC[PPa]FAETGQVQK_3_y7_1,
 ANVPN[Dea]KVIQC[PPa]FAETGQVQK_3_y8_1 ...
INFO  [2024-02-14 21:55:44] The following runs have more than 75% missing values: 11
INFO  [2024-02-14 21:55:44]  == Start the summarization per subplot...
  |                                                                                                                                               |   0%Aggregate function missing, defaulting to 'length'

<simpleError in .Primitive("length")(newABUNDANCE, keep = TRUE): 2 arguments passed to 'length' which requires 1>
INFO  [2024-02-14 21:55:54]  == Summarization is done.

Error: Elements listed in `by` must be valid column names in x and y
In addition: Warning message:

Input data.table 'y' has no columns. 

actually, we have MSstats 2.4 installed in the other computer and we used this file to run in that version and it just worked fine.

Mateusz Staniak

unread,
Feb 14, 2024, 11:21:23 AM2/14/24
to MSstats
I recommend using the latest version of MSstats (particularly because 2.4 uses different statistical approach IIRC). Please look into documentation for MSstatsPreProcess and MSstatsBalancedDesign functions from MSstatsConvert package. These functions ensure that data is in the right format for summarization step.


Kind regards,
Mateusz
Reply all
Reply to author
Forward
0 new messages