dataprocess error with SRM data

esin sahin

unread,

Jul 29, 2024, 8:46:15 AM7/29/24

to MSstats

Hello,
I am new to this package and R language in general and I have been trying to use MSStat package for SRM data in R. I exported the results from Skyline in a csv format, added the "Condition", "BioReplicate", "Run" columns manually in Excel, which are 1 for all of the rows. However, when I try to use the function "dataprocess" with this code "dataProcess(unique_data, normalization = 'equalizeMedians')",
I get this error: 8%<simpleError in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels>
INFO [2024-07-29 15:06:53] == Summarization is done.
Elements listed in `by` must be valid column names in x and y
Input data.table 'y' has no columns.

I also checked the above code again with the data that I have recevied with unique() function, but still the same error proceeds.

Lastly, when I try to use the "SkylinetoMSStat" function before I run the dataprocess code , I get this error : Can't assign 4 names to a 0-column data.table

My R version is 4.4.1 (2024-06-14 ucrt), Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Matrix products: default and MSstats_4.12.0
Thank you very much in advance.

Anthony Wu

unread,

Jul 29, 2024, 11:12:53 AM7/29/24

to MSstats

Hi,

Could you share your Skyline input file that you used, along with the columns that you added of "condition", "bioreplicate", and "run"? We should be able to better diagnose the issue with the dataset in our side.

Thanks,

Tony

Message has been deleted

esin sahin

unread,

Jul 30, 2024, 8:38:57 AM7/30/24

to MSstats

Hello,

I have attached the skyline file and the csv file that I added the "condition", "bioreplicate", and "run" columns below.

I also get this error : 8%Aggregate function missing, defaulting to 'length'
<simpleError in .Primitive("length")(newABUNDANCE, keep = TRUE): 2 arguments passed to 'length' which requires 1>
INFO [2024-07-30 15:34:35] == Summarization is done.

Elements listed in `by` must be valid column names in x and y
Input data.table 'y' has no columns.

Thank you very much for your reply,

Esin

29 Temmuz 2024 Pazartesi tarihinde saat 18:12:53 UTC+3 itibarıyla wu.a...@husky.neu.edu şunları yazdı:

msstatskylineforum.csv

skylinemstattrial.sky

Anthony Wu

unread,

Aug 5, 2024, 10:47:10 AM8/5/24

to MSstats

Hi,

What is the experimental design of your experiment?

I looked at the CSV file, and I only see one bioreplicate and one condition. To perform differential abundance analysis with MSstats, you would need at least 2 conditions, and at least 2 bioreplicates per condition.

Thanks,

Tony

esin sahin

unread,

Aug 21, 2024, 7:08:14 AM8/21/24

to MSstats

Hello,

I apologize for the delayed response. We have edited the data and now there are 3 conditions with 2 bioreplicates each. However, we actually do not have bioreplicates, we just used the same data for bioreplicate 2 for MSstat to work, as you can see in the CSV file I have attached below. However, dataProcess function gave another error with this updated data:

"Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in 36 rows; more than 12 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice."

I try to include allow.cartesian=TRUE in the code : dataProcess(data, normalization = 'quantile', allow.cartesian=TRUE) but it said "unused argument"

I was wondering what can it be the reason for these errors, and how can be solved. Thank you very much for your time.

Best regards,

Esin

5 Ağustos 2024 Pazartesi tarihinde saat 17:47:10 UTC+3 itibarıyla wu.a...@husky.neu.edu şunları yazdı:

msstattrialdata.csv

Anthony Wu

unread,

Aug 27, 2024, 11:29:44 AM8/27/24

to MSstats

Hi Esin,

The issue with your dataset is that every measurement belongs to the same run, which should not be the case in a label free experiment. I adjusted the "Run" column of your dataset such that each unique bioreplicate + condition pair is associated with a unique run ID.

One additional note - I saw in your dataset that the bioreplicate column contained values corresponding to multiple conditions, e.g. bioreplicate "1" corresponded to conditions A-1, A-2, and A-3. This annotation represents a repeated measures design, where the same bioreplicate is measured across conditions. If you're not using a repeated measures design, then each value in the bioreplicate column should only correspond to 1 condition. I attached an XLSX file illustrating this point.

Thanks,

Tony

msstatstrialdata.csv

annotation example.xlsx

esin sahin

unread,

Sep 10, 2024, 6:19:28 AM9/10/24

to MSstats

Hello,

Thank you very much for your response and help. We are now testing another data set where each Run is unique and there are two bioreplicates for each condition. Using the MSstat tool in Skyline with dataproccess gave successful results with this data however when I tried to do it in R with this code: srm.equalmed <- dataProcess(data, normalization = 'equalizeMedians'),

I obtained this error:

"Error in .handleFractionsLF(input) :
** It is hard to find the same fractionation across samples, due to lots of overlapped features between fractionations. Please add Fraction column in input."

Normally there is not any fraction in this data however, I tried to add a fraction column to the data and set it to 1 for all values to try and see if it would solve the error but it did not work and the same error occurred. I would be very happy if you could help with this error. I have attached the data below.

Lastly, I wanted to ask a general question about MSstat, Are the normalization options in this package heavy to light or light to heavy?

Thank you very much in advance.

Best regards,

27 Ağustos 2024 Salı tarihinde saat 18:29:44 UTC+3 itibarıyla wu.a...@husky.neu.edu şunları yazdı:

msstat0909forum (1).csv

Anthony Wu

unread,

Sep 13, 2024, 3:38:02 PM9/13/24

to MSstats

Hi Esin,

Your data looks a little strange - why is there a unique run ID for each row in the data? Generally in label free experiments, I'd expect to be multiple proteins measured in the same MS run. Because of this setup, our software is assuming you performed fractionation in your experiment.

Could you also elaborate what you mean by a normalization option being "light" vs "heavy"? I'm not sure what you mean there.

Thanks,

Tony

esin sahin

unread,

Sep 16, 2024, 3:57:09 AM9/16/24

to MSstats

Hello,
This experiment is not label-free, there are heavy labeled isotopes as references and the data belongs to QQQ MS_MRM. Also, what I meant is, does the normalization option of MSstats take the ratio of heavy-labeled isotopes to light-labeled isotopes or light-labeled to heavy-labeled isotopes? Thank you for your response and help.

Best regards,

Esin

13 Eylül 2024 Cuma tarihinde saat 22:38:02 UTC+3 itibarıyla antho...@gmail.com şunları yazdı:

Anthony Wu

unread,

Sep 26, 2024, 12:02:37 PM9/26/24

to MSstats

Hi,

When introducing heavy-labeled isotopes, the normalization option in MSstats uses the heavy-labeled isotopes as a reference for normalizing all intensities. For example, if using the option "equalizeMedians", MSstats will take the median of intensities of heavy-labeled isotopes for each run and normalize all intensities based on those medians. Let me know if you need any clarification on what I mean here.

I'm still unsure why your dataset is not processing - I believe it's because each row (i.e. fragment identification/quantification) corresponds to a unique run which I don't think is realistic (e.g. did you perform 222 unique MS runs?). If you want an example SRM dataset to look at, you can use the one I attached.

Thanks,

Tony

SRMRawData.csv

esin sahin

unread,

Oct 25, 2024, 9:47:40 AM10/25/24

to MSstats

Hello,
We have tried another data with 3 conditions, unique bioreplicate values with different a Run column, unlike the last problematic one with increasing values. However, we have encountered an error that I have asked you about before :

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :

"Join results in 36 rows; more than 18 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice."

I tried running the code: dataProcess(data, normalization = 'equalizeMedians')
allow.cartesian=TRUE
dataProcessPlots(srm.equalmed, type='QCplot', address=FALSE)
, but it did not work. I also looked for duplicated values in Excel as the error suggests, but could not find any. I have attached the data below. Thank you very much for your time and help.
Best regards,
Esin Şahin

26 Eylül 2024 Perşembe tarihinde saat 19:02:37 UTC+3 itibarıyla wu.a...@husky.neu.edu şunları yazdı:

MSStat-2510forum.csv

Anthony Wu

unread,

Oct 31, 2024, 9:27:15 AM10/31/24

to MSstats

Hi,

Two issues with your dataset:

1. Each bioreplicate should correspond to a unique run ID. In your dataset, you have multiple bioreplicates correspond to a single MS run, which doesn't make sense for an SRM experiment.

2. The intensity values are formatted improperly. E.g. I see an intensity value of 6.515.416.875, which is not a number. I recommend following the American decimal number format (e.g. [whole number].[decimal])

Tony

Reply all

Reply to author

Forward