MSstats error

komw...@gmail.com

unread,

May 15, 2023, 12:43:46 PM5/15/23

to MSstats

Hello MSstats team

I have a quick question about MSstats normalization process.

I processed to do median equalize before group comparison function in MSstats and It gave me this error

> srm.equalmed <- dataProcess(SRMRawData, normalization = 'equalizeMedians') INFO [2023-05-16 01:04:03] ** Features with one or two measurements across runs are removed. INFO [2023-05-16 01:04:03] ** Fractionation handled. INFO [2023-05-16 01:04:03] ** Updated quantification data to make balanced design. Missing values are marked by NA INFO [2023-05-16 01:04:03] ** Log2 intensities under cutoff = 5.4378 were considered as censored missing values. INFO [2023-05-16 01:04:03] ** Log2 intensities = NA were considered as censored missing values. INFO [2023-05-16 01:04:03] ** Use all features that the dataset originally has. INFO [2023-05-16 01:04:03] # proteins: 65 # peptides per protein: 1-1 # features per peptide: 3-5 INFO [2023-05-16 01:04:03] Disease Healthy # runs 51 36 # bioreplicates 51 36 # tech. replicates 1 1 INFO [2023-05-16 01:04:03] Some features are completely missing in at least one condition: EHVAHLLFLR_3_y3_1, NA ... INFO [2023-05-16 01:04:03] == Start the summarization per subplot... |==================================== | 28%Aggregate function missing, defaulting to 'length' <simpleError in .Primitive("length")(newABUNDANCE, keep = TRUE): 2 arguments passed to 'length' which requires 1> INFO [2023-05-16 01:04:03] == Summarization is done. Error in merge.data.table(input[, colnames(input) != "newABUNDANCE", with = FALSE], : Elements listed in `by` must be valid column names in x and y In addition: Warning messages: 1: In survreg.fit(X, Y, weights, offset, init = init, controlvals = control, : Ran out of iterations and did not converge 2: In merge.data.table(input[, colnames(input) != "newABUNDANCE", with = FALSE], : You are trying to join data.tables where 'y' argument is 0 columns data.table.

how can I solve it?

I also attached my Skyline output data

MSstatsInput5.csv

Mateusz Staniak

unread,

May 15, 2023, 12:56:23 PM5/15/23

to MSstats

Hi,

I wasn't able to reproduce the issue. Did you apply the converter function before running dataProcess?

My code:

sli = readr::read_csv("MSstatsInput5.csv")
sli_conv = MSstats::SkylinetoMSstatsFormat(sli)
unique(sli_conv$IsotopeLabelType)

sli_dp = MSstats::dataProcess(sli_conv) # no error

sessionInfo()

R version 4.2.1 (2022-06-23) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.10 Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.1 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.1 locale: [1] LC_CTYPE=pl_PL.UTF-8 LC_NUMERIC=C LC_TIME=pl_PL.UTF-8 LC_COLLATE=pl_PL.UTF-8 LC_MONETARY=pl_PL.UTF-8 [6] LC_MESSAGES=pl_PL.UTF-8 LC_PAPER=pl_PL.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=pl_PL.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] MSstats_4.6.5 gtools_3.9.4 tidyselect_1.2.0 splines_4.2.1 lattice_0.20-45 log4r_0.4.3 [7] colorspace_2.1-0 vctrs_0.6.1 generics_0.1.3 utf8_1.2.3 survival_3.4-0 marray_1.76.0 [13] rlang_1.1.0 pillar_1.9.0 nloptr_2.0.3 glue_1.6.2 withr_2.5.0 bit64_4.0.5 [19] lifecycle_1.0.3 munsell_0.5.0 gtable_0.3.3 caTools_1.18.2 tzdb_0.3.0 parallel_4.2.1 [25] fansi_1.0.4 preprocessCore_1.60.2 Rcpp_1.0.10 KernSmooth_2.23-20 readr_2.1.3 scales_1.2.1 [31] backports_1.4.1 checkmate_2.1.0 limma_3.54.1 vroom_1.6.1 bit_4.0.5 lme4_1.1-31 [37] gplots_3.1.3 ggplot2_3.4.2 hms_1.1.2 stringi_1.7.12 dplyr_1.1.1 ggrepel_0.9.2 [43] grid_4.2.1 MSstatsConvert_1.9.3 cli_3.6.1 tools_4.2.1 bitops_1.0-7 magrittr_2.0.3 [49] tibble_3.2.1 crayon_1.5.2 pkgconfig_2.0.3 ellipsis_0.3.2 MASS_7.3-58.1 Matrix_1.5-4 [55] data.table_1.14.8 minqa_1.2.5 rstudioapi_0.14 R6_2.5.1 boot_1.3-28 nlme_3.1-159 [61] compiler_4.2.1

Kind regards,

Mateusz

Xinle Tan

unread,

Feb 13, 2024, 8:04:58 AM2/13/24

to MSstats

Hi Mateusz:

I ran into the same error as the author using my own dataset. then I used the input file from this author and ran into exact the same error.

my code:
library(MSstats) #imports msstats
ms <- read.csv('MSstatsInput5.csv', sep=",")
head(ms) #shows column names
QuantData <- MSstats::dataProcess(ms) #processes the data

> ms <- read.csv('MSstatsInput5.csv', sep=",")
> head(ms) #shows column names
ProteinName PeptideSequence PeptideModifiedSequence PrecursorCharge FragmentIon ProductCharge IsotopeLabelType Run Intensity Condition
1 sp|P00748|FA12_HUMAN TEQAAVAR TEQAAVAR 2 b3 1 heavy 4 37315 Healthy
2 sp|Q06033|ITIH3_HUMAN DYIFGNYIER DYIFGNYIER 2 b3 1 heavy 4 43777 Healthy
3 sp|P02743|SAMP_HUMAN VFVFPR VFVFPR 2 b3 1 heavy 4 52653 Healthy
4 sp|P00740|FA9_HUMAN NCELDVTCNIK NC[+57]ELDVTC[+57]NIK 2 b3 1 heavy 4 58457 Healthy
5 sp|P00488|F13A_HUMAN GTYIPVPIVSELQSGK GTYIPVPIVSELQSGK 2 b3 1 heavy 4 58919 Healthy
6 sp|P22891|PROZ_HUMAN GLLSGWAR GLLSGWAR 2 b3 1 heavy 4 79872 Healthy
BioReplicate
1 4
2 4
3 4
4 4
5 4
6 4
> QuantData <- MSstats::dataProcess(ms) #processes the data
INFO [2024-02-13 23:02:56] ** Features with one or two measurements across runs are removed.
INFO [2024-02-13 23:02:56] ** Fractionation handled.
INFO [2024-02-13 23:02:56] ** Updated quantification data to make balanced design. Missing values are marked by NA
INFO [2024-02-13 23:02:57] ** Log2 intensities under cutoff = 5.4378 were considered as censored missing values.
INFO [2024-02-13 23:02:57] ** Log2 intensities = NA were considered as censored missing values.
INFO [2024-02-13 23:02:57] ** Use all features that the dataset originally has.
INFO [2024-02-13 23:02:57]

# proteins: 65
# peptides per protein: 1-1
# features per peptide: 3-5

INFO [2024-02-13 23:02:57]

Disease Healthy
# runs 51 36
# bioreplicates 51 36
# tech. replicates 1 1

INFO [2024-02-13 23:02:57] Some features are completely missing in at least one condition:
EHVAHLLFLR_3_y3_1,
NA ...
INFO [2024-02-13 23:02:57] == Start the summarization per subplot...
|======================================== | 28%Aggregate function missing, defaulting to 'length'

INFO [2024-02-13 23:02:57] == Summarization is done.
Error: Elements listed in `by` must be valid column names in x and y

In addition: Warning messages:
1: In survreg.fit(X, Y, weights, offset, init = init, controlvals = control, :
Ran out of iterations and did not converge

2: Input data.table 'y' has no columns.

below is my R version:

> R.version
_
platform x86_64-apple-darwin20
arch x86_64
os darwin20
system x86_64, darwin20
status
major 4
minor 3.2
year 2023
month 10
day 31
svn rev 85441
language R
version.string R version 4.3.2 (2023-10-31)
nickname Eye Holes

Has anyone solved this problem?

Mateusz Staniak

unread,

Feb 13, 2024, 8:16:15 AM2/13/24

to MSstats

Hi,

can you share a small subset of data that reproduces this issue? Data can be anonymized by changing protein, run, condition labels. As long as the error is reproduced, one protein might be enough.

Kind regards,
Mateusz

Xinle Tan

unread,

Feb 13, 2024, 9:12:32 AM2/13/24

to MSstats

Hi Mateusz:

I tried subsetting only a few proteins to represent the dataset. and the error is not there anymore!

apparently there is something going on with a few proteins. how would I know which ones were wrong? I have 5000 proteins and the dataset is 260 MB.

Xinle Tan

unread,

Feb 13, 2024, 9:41:13 AM2/13/24

to MSstats

Hi Mateusz:

I found one protein that can reproduce the error here.

greatly appreciated if you can find out what is happening.

A0A8V1ABS2.csv

Mateusz Staniak

unread,

Feb 14, 2024, 6:34:56 AM2/14/24

to MSstats

Hi, the problem is most likely due to repeated rows, for example:

...1 X ProteinName PeptideSequence PrecursorCharge FragmentIon ProductCharge IsotopeLabelType 1: 93193 93192 A0A8V1ABS2 VKQLPLVKPYLR 2 y8 1 L 2: 93194 93193 A0A8V1ABS2 VKQLPLVKPYLR 2 y8 1 L Condition BioReplicate Run Intensity 1: A A_1 1 10165.74 2: A A_1 1 10165.74
did you run a converter before summarization? I'm curious how exactly these rows avoided aggregation step. As a solution, please either remove the first two columns and use unique() on resulting table, or aggregate data per protein, feature.

Kind regards,

Mateusz

Xinle Tan

unread,

Feb 14, 2024, 7:03:42 AM2/14/24

to MSstats

Hi Mateusz:

no I did not run a converter. we used PeakView for peptides quantitation and I cannot see a converter function from peakview output to MSstats, unless I have missed it?

therefore i wrote a script myself to convert the wide format to long format.

following your suggestion, I have removed the duplciate rows and ran the dataprocess again protein by protein. but it looks like this particular protein (A0A8V1ABS2) still ran into the same error. below is my code:

library(MSstats) #imports msstats
ms <- read.delim('Norman_jejunum_reformat_20240214.txt',sep = '\t')
head(ms) #shows column names
ms = ms[!duplicated(ms),]
test_list = unique(ms$ProteinName)
for (p in test_list) {
ms1 <- ms[ms$ProteinName == p,]
print(p)
QuantData <- MSstats::dataProcess(ms1)
}

the first few proteins were fine and then the error popped up, and here is the last few rows of the output.

[1] "A0A8V1ABS2"
INFO [2024-02-14 21:55:44] ** Features with one or two measurements across runs are removed.
INFO [2024-02-14 21:55:44] ** Fractionation handled.
INFO [2024-02-14 21:55:44] ** Updated quantification data to make balanced design. Missing values are marked by NA
INFO [2024-02-14 21:55:44] ** Log2 intensities under cutoff = 7.8211 were considered as censored missing values.
INFO [2024-02-14 21:55:44] ** Log2 intensities = NA were considered as censored missing values.
INFO [2024-02-14 21:55:44] ** Use all features that the dataset originally has.
INFO [2024-02-14 21:55:44]
# proteins: 1
# peptides per protein: 59-59
# features per peptide: 4-6
INFO [2024-02-14 21:55:44]
A B
# runs 10 10
# bioreplicates 10 10

# tech. replicates 1 1

INFO [2024-02-14 21:55:44] Some features are completely missing in at least one condition:
ANVPN[Dea]KVIQC[PPa]FAETGQVQK_3_y10_1,
ANVPN[Dea]KVIQC[PPa]FAETGQVQK_3_y12_1,
ANVPN[Dea]KVIQC[PPa]FAETGQVQK_3_y16_2,
ANVPN[Dea]KVIQC[PPa]FAETGQVQK_3_y7_1,
ANVPN[Dea]KVIQC[PPa]FAETGQVQK_3_y8_1 ...
INFO [2024-02-14 21:55:44] The following runs have more than 75% missing values: 11
INFO [2024-02-14 21:55:44] == Start the summarization per subplot...
| | 0%Aggregate function missing, defaulting to 'length'

INFO [2024-02-14 21:55:54] == Summarization is done.

Error: Elements listed in `by` must be valid column names in x and y

In addition: Warning message:

Input data.table 'y' has no columns.

actually, we have MSstats 2.4 installed in the other computer and we used this file to run in that version and it just worked fine.

Mateusz Staniak

unread,

Feb 14, 2024, 11:21:23 AM2/14/24

to MSstats

I recommend using the latest version of MSstats (particularly because 2.4 uses different statistical approach IIRC). Please look into documentation for MSstatsPreProcess and MSstatsBalancedDesign functions from MSstatsConvert package. These functions ensure that data is in the right format for summarization step.

Kind regards,
Mateusz

Reply all

Reply to author

Forward