Problem converting label-free MaxQuant data into MSstatsPTM format

324 views
Skip to first unread message

Sofia Farkona

unread,
Sep 10, 2024, 4:58:02 PM9/10/24
to MSstats

Hello all. My colleague and I are currently in the process of analyzing our phosphoproteomics LC-MS/MS-based data. I have been trying and miserably failing to convert these label-free MaxQuant data into MSstatsPTM format. I tried to follow the instructions from the MaxQtoMSstatsPTMFormat R documentation (it can also be seen here new.bioconductor.org/packages/release/bioc/vignettes/MSstatsPTM/inst/doc/MSstatsPTM_LabelFree_Workflow.R   and here MSstatsPTM: Statistical Characterization of Post-translational Modifications (bioconductor.org). )

I have the "evidence.txt" file from MaxQuant, I made the annotation file as explained, I have the protein groups.txt file and the Phospho(STY)Sites.txt (although the last one not required), but I have been failing and failing. (I followed the code from the example below).

 code from MaxQtoMSstatsPTMFormat R documentation:

msstats_format_lf = MaxQtoMSstatsPTMFormat(evidence=maxq_lf_evidence,

                                     annotation=maxq_lf_annotation,

                                     fasta=system.file("extdata",

                                                       "maxq_lf_fasta.fasta",

                                                       package="MSstatsPTM"),

                                     fasta_protein_name="uniprot_ac",

                                     mod_id="\\(Phospho \\(STY\\)\\)",

                                     use_unmod_peptides=TRUE,

                                     labeling_type = "LF",

                                     which_proteinid_ptm = "Proteins")


When I tried to run, I got the following error in the console: "Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :

Join results in 846403 rows; more than 377678 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice."

From my understanding the error is somehow relevant to duplicate protein IDs in the evidence file? But these duplicated protein IDs are there because of the way data are reported in Evidence file. For example, a peptide sequence (reported in Sequence column) will be reported multiple times if it is identified in the different experiments (column experiment) or in the different raw file (column Raw.file). In this way the respective protein ID will also be reported multiple times? Am I wrong? I have run MaxQuant multiple time, this is how evidence file would always look.

I tried to run with allow cartesian=TRUE; did not work either.

Any help will be appreciated

Best,

S


Sofia Farkona

unread,
Sep 12, 2024, 8:42:25 PM9/12/24
to MSstats
Hello all I thought to reach out again and provide some extra information. I am attaching here smaller versions of my data:                                                                      First 100 rows of my relevant csv files,                                                                                          

Extra information:
> packageVersion("MSstatsPTM")
[1] ‘2.6.0’
> R.version.string
[1] "R version 4.4.1 (2024-06-14 ucrt)"
  
I am really hoping for some feedback.

Best,
S         

annotation100.csv
PhosSTY_rand.csv
proteins100.csv
evidence100.csv

Anthony Wu

unread,
Sep 13, 2024, 3:13:29 PM9/13/24
to MSstats
Hi,

I took a look and the FASTA file is the important piece that seems to be missing here.  To clarify, this FASTA file is obtained by querying Uniprot with all of the protein IDs present in your evidence file.  The reason why this FASTA file is important is because MaxQuant's evidence file does not report the specific amino acid that is modified relative to the whole protein sequence (rather, it reports the specific amino acid relative to the reported peptide).  The FASTA file allows MSstatsPTM to determine the absolute location of the modification relative to the whole protein.

Let me know if you have any additional questions / bugs.  I'll also look to update the documentation / vignettes to ensure it's more clear why the FASTA file is needed and how to get it.

Thanks,
Tony

Sofia Farkona

unread,
Sep 13, 2024, 8:29:46 PM9/13/24
to MSstats
Thanks a lot Tony for your feedback. I will try this and update. 

Best,
S

Sofia Farkona

unread,
Sep 17, 2024, 6:20:08 PM9/17/24
to MSstats
Hello Tony, I am back! Unfortunately, the error persists. This will be a long email, but it explains what I did.
My actions after your last email:

  • I extracted the unique protein IDs from my evidence.csv file. For this task I used "evidence$Leading.razor.protein" column in this way:
unique_protein_ids <- unique(evidence$Leading.razor.protein)

  • I wrote the unique protein IDs to a text file
write.table(unique_protein_ids, file = "protein_ids.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)

I submitted and then retrieved 4,795 results (proteins) as a fasta file.

  • I installed & loaded BioStrings Package. Then I loaded my FASTA file in R to check it:
> fasta_file <- readAAStringSet("E:/phospho_all_Zahraas and mine/day2_TMT_PTM_my attempt_Asus_office/Code/idmapping_2024_09_17.fasta")

  • I inspected the first few entries:
> head(names(fasta_file)) # Protein names or IDs [1] "sp|P55011|S12A2_HUMAN Solute carrier family 12 member 2 OS=Homo sapiens OX=9606 GN=SLC12A2 PE=1 SV=1" [2] "sp|Q86U42|PABP2_HUMAN Polyadenylate-binding protein 2 OS=Homo sapiens OX=9606 GN=PABPN1 PE=1 SV=3"
  • I inspected the first sequence
> fasta_file[[1]] # Show the first protein sequence 1212-letter AAString object seq: MEPRPTAPSSGAPGLAGVGETPSAAALAAARVELPGTAVPSVPEDAAPASR...ANIIVMSLPVARKGAVSSALYMAWLEALSKDLPPILLVRGNHQSVLTFYS
  • Checked number of sequences in the FASTA file
> # Check how many sequences are in the FASTA file
> length(fasta_file)
[1] 4795

  • After loading MSstatsPTM again I made my annotation file: 
> annotation <- data.frame( + Run = evidence$Raw.file, + BioReplicate = evidence$Raw.file, + Condition = evidence$Experiment, + Raw.file = evidence$Raw.file, + IsotopeLabelType="L")

It seemed fine, 

  • I tried to run the Conversion to MSstatsPTM Format using the code available for MaxQuant Label free data
msstats_format_lf <- MaxQtoMSstatsPTMFormat( + evidence = evidence, # my evidence file + annotation = annotation, # My annotation file + fasta = "E:/phospho_all_Zahraas and mine/day2_TMT_PTM_my attempt_Asus_office/Code/idmapping_2024_09_17.fasta", # Path to my newly created FASTA file + fasta_protein_name = "uniprot_ac", # Uniprot ID in the FASTA file + mod_id = "\\(Phospho \\(STY\\)\\)", # Phosphorylation modification + use_unmod_peptides = TRUE, # Included unmodified peptides + labeling_type = "LF", # Label-Free quantification + which_proteinid_ptm = "Leading.razor.protein" # I am using the 'Leading.razor.protein' column from the evidence file + )

Unfortunately, I got the exact same error.
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 486119 rows; more than 460104 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

 As in my previous email am attaching here smaller versions of my evidence file, annotations file, PhosphoSTY file, Protein groups file (first 100 rows of each one). 

When possible, could you have a look especially in the evidence file? I am wondering whether there is something wrong with this one (and sequentially with annotation file since I made it using info from evidence file).
In the past (before trying to explore MSstatsPTM) I never looked at the evidence file. I would firstly use MaxQuant and then Perseus to analyze the data and so I would only work with ProteinGroups file and PhosphoSTY file (I don't need evidence for analysis with Perseus).  I am wondering whether the way I am setting my analysis in MaxQuant (when I upload my raw files to analyze them) results in an evidence file non compatible with the format MSstatsPTM requires. In that case I should know it in order to correct this.

I would really appreciate some feedback.

Best,
Sofia
PhosSTY100.csv
proteins100.csv
annotation100.csv
evidence100.csv

Anthony Wu

unread,
Sep 17, 2024, 6:29:21 PM9/17/24
to MSstats
Hi,

By any chance, could you also attach your FASTA file (if it's too big, maybe a FASTA file with the proteins in the smaller version of your dataset)?  Want to make sure I'm using the same FASTA file as you before investigating further.

I'll be taking a couple days off starting today but will investigate further once I get back.

Thanks,
Tony

Sofia Farkona

unread,
Sep 23, 2024, 2:52:08 PM9/23/24
to MSstats
Hello Tony, for sure. I attached here.

Enjoy your time off and looking forward to discussing next steps.

Best,
S

idmapping_2024_09_17.fasta

Anthony Wu

unread,
Sep 25, 2024, 9:22:30 PM9/25/24
to MSstats
Hi,

I was not able to reproduce the same error message as you with the smaller dataset (I may need a larger subset of your data), but one thing I did notice was that your annotation file has duplicate rows.  When I removed duplicate rows, the converter with the smaller dataset processed without any errors.  Could you remove the duplicate rows from your side and let me know if you encounter the same error as before?

```
annotation <- data.frame(
    Run = evidence$Raw.file,
    BioReplicate = evidence$Raw.file, 
    Condition = evidence$Experiment,
    Raw.file = evidence$Raw.file, 
    IsotopeLabelType="L"
)
annotation <- annotation[!duplicated(annotation),] # removes duplicate rows
```

Thanks,
Tony

Sofia Farkona

unread,
Sep 26, 2024, 4:19:54 PM9/26/24
to MSstats
Hello Tony for sure I can do try this. However, can I ask the following question (because it is important for me to understand what is happening and why it is happening to the best of ability).
The annotation file is made using evidence file.
From your run test, the smaller version of evidence file I shared (evidence100) did not contribute to the problem?

What I will do is this: I will remove the duplicate rows from annotation100.csv file (in the way you suggested). Then I will try to run the conversion using the other small versions too (basically I will try to do exactly what you did). If it is successful I will continue with the full versions. If I encounter bugs (during any of these exercises) I will report. 

In any case I will update.

Best,
S

Anthony Wu

unread,
Oct 2, 2024, 11:19:22 AM10/2/24
to MSstats
Hi,

That's correct, the smaller version of the evidence file you shared did not contribute to the problem.  

Tony

Message has been deleted
Message has been deleted

Sofia Farkona

unread,
Oct 3, 2024, 8:31:10 PM10/3/24
to MSstats

Hello Tony, I am back with updates. I had some progress during the last days after, as you suggested, I had removed the duplicate rows from the annotation file.

I did this firstly using the smaller versions of my files (first 100 rows from evidence and annotation file, annotation file always cleaned from duplicate rows). I gradually increased the size of the files and finally I ran the conversion for the full files. Below you can see how this went:


> msstats_format_lf_full = MaxQtoMSstatsPTMFormat(

+     evidence = evidence,

+     annotation = annotationclean,

+     fasta = "E:/phospho_all_Zahraas and mine/day2_TMT_PTM_my attempt_Asus_office/Code/idmapping_2024_09_17.fasta", 

+     fasta_protein_name = "uniprot_ac", 

+     mod_id = "\\(Phospho \\(STY\\)\\)", 

+     use_unmod_peptides = TRUE, 

+     labeling_type = "LF", 

+     which_proteinid_ptm = "Proteins"  # Use the 'Proteins' column as in the documentation not the razor proteins

+ )


Question 1

 I did not get any obvious errors in the console, but during the conversion I saw this message in yellow:

[1] "FASTA file missing 5225 Proteins. These will be removed. This may be due to non-unique identifications."

 

This is concerning and so I would like to make sure that I made the FASTA file correctly. I made the FASTA file using “leading.razor.protein” variable (column) from evidence file: unique_protein_ids <- unique(evidence$Leading razor protein).  Maybe this was NOT correct? Maybe I should have made it using evidence$Protein What do you think?

……………………..

Question 2

About the resulted my msstats_format_lf_full$PROTEIN and msstats_format_lf_full$PTM data frames: I get many observations (rows) in both of them:

> nrow(msstats_format_lf_full$PROTEIN)

[1] 280944

> nrow(msstats_format_lf_full$PTM)

[1] 20112

Each data frame has 11 variables or columns, but in some of these columns I get lots of NA. From msstats_format_lf_full$PROTEIN for some columns I only get NA (columns: PrecursorCharge, Fragmentation). Also, there are many NA in the Intensity column (could that be because “FASTA file misses 5225 Proteins as it was mentioned earlier?

Here is an example for msstats_format_lf_full$PROTEIN: 

> msstats_format_lf_full$PROTEIN %>% head(5)

  ProteinName PeptideSequence PrecursorCharge FragmentIon ProductCharge IsotopeLabelType            Condition

1      P36578     AAAAAAALQAK               2          NA            NA                L     SN1_proteomeJN19

2      P36578     AAAAAAALQAK               2          NA            NA                L SN1_proteomeJN19_rep

3      P36578     AAAAAAALQAK               2          NA            NA                L      SN1_proteomeJN4

4      P36578     AAAAAAALQAK               2          NA            NA                L  SN1_proteomeJN4_rep

5      P36578     AAAAAAALQAK               2          NA            NA                L     SN2_proteomeJN19

          BioReplicate                  Run Fraction                          Intensity

1     SN1_proteomeJN19     SN1_proteomeJN19        1 1124000000

2 SN1_proteomeJN19_rep SN1_proteomeJN19_rep        1  838480000

3      SN1_proteomeJN4      SN1_proteomeJN4        1  955310000

4  SN1_proteomeJN4_rep  SN1_proteomeJN4_rep        1  642690000

5     SN2_proteomeJN19     SN2_proteomeJN19        1  572790000

 

About msstats_format_lf_full$PTM I also get many NA, in Intensity column as well

> msstats_format_lf_full$PTM %>% head(5)

    ProteinName              PeptideSequence PrecursorCharge FragmentIon ProductCharge IsotopeLabelType

337 P49006_T148 AAAT(Phospho (STY))PESQEPQAK               2          NA            NA                L

338 P49006_T148 AAAT(Phospho (STY))PESQEPQAK               2          NA            NA                L

339 P49006_T148 AAAT(Phospho (STY))PESQEPQAK               2          NA            NA                L

340 P49006_T148 AAAT(Phospho (STY))PESQEPQAK               2          NA            NA                L

341 P49006_T148 AAAT(Phospho (STY))PESQEPQAK               2          NA            NA                L

               Condition         BioReplicate                  Run Fraction                                      Intensity

337     SN1_proteomeJN19     SN1_proteomeJN19     SN1_proteomeJN19        1        NA

338 SN1_proteomeJN19_rep SN1_proteomeJN19_rep SN1_proteomeJN19_rep        1        NA

339      SN1_proteomeJN4      SN1_proteomeJN4      SN1_proteomeJN4        1        NA

340  SN1_proteomeJN4_rep  SN1_proteomeJN4_rep  SN1_proteomeJN4_rep        1        NA

341     SN2_proteomeJN19     SN2_proteomeJN19     SN2_proteomeJN19        1        NA

 

To summarize, although there was a significant progress, it seems that some things are still not optimized. I would love to know what changes I should try to get the “msstats_format_lf_full” as it should be and continue with the analysis.

 

Best,

Sofia

Sofia Farkona

unread,
Oct 3, 2024, 8:33:41 PM10/3/24
to MSstats

Continuing from the previous email, I have attached here a zip folder containing the smaller versions of the files I got from MaxQuant (first 5000 rows). So I included evidence5000, created the annotation file using evidence5000 and removed duplicate rows. I also included phospho_STY_Sites and proteinsGroups from MaxQuant (first 5000 rows) just in case they are useful.

I will appreciate a lot your feedback.


Best,

S

files_5000rows.zip

Anthony Wu

unread,
Oct 11, 2024, 1:05:31 PM10/11/24
to MSstats
Hi,

Thank you for your detailed response.  I took a look and there seem to be 3 independent issues going on here.

1. "FASTA file missing 5225 Proteins. These will be removed. This may be due to non-unique identifications."

For this issue, it looks like the FASTA file approach does not handle protein isoforms well (e.g. P55011-3).  I need to investigate this further, but to expedite the process, could you also include the FASTA file you used?  On Uniprot, depending on the settings you use, the FASTA file generated will distinguish between various isoforms, but I need to confirm whether you follow this case or not.

I also have an idea to enable these functions to work without FASTA files in the future.  If you think that'd make these functions much easier to use, let me know and I can look into prioritizing that change.

2. "For some columns I only get NA"

I looked at your table and it looks like the columns "FragmentIon" and "ProductCharge" are NA, which is the case for DDA experiments.  Could you confirm you used DDA? 

3. "there are many NA in the Intensity column"

Looking at your annotation file, could you confirm you did 16 MS runs on the global proteome, then do 8 MS runs involving enrichment?  If so, then the explanation to the NAs & the adjustment needed should make sense below:

One reason there are many NAs in intensity is because there are no detected PTMs in the proteome runs.  For example, in row 338 you shared:

338 P49006_T148 AAAT(Phospho (STY))PESQEPQAK               2          NA            NA                L SN1_proteomeJN19_rep SN1_proteomeJN19_rep SN1_proteomeJN19_rep        1        NA

Intensity is NA for the proteome run corresponding to P49006 with phosphorylation at site 148, which makes sense since there are no PTMs reported in the proteome run.  

With separate runs on the global proteome and enrichment, you can run MaxQuant's identification and quantification twice, once on the proteome runs and once on the enriched runs.  Then after generating 2 MaxQ reports, one for the proteome and one for the PTM, you can run MSstatsPTM's converter with adjustments to its parameters (highlighted in yellow and bolded below).  In summary, this adjustment will store the global proteome information in msstats_format_lf_full$PROTEIN and the enriched information in msstats_format_lf_full$PTM.  

 msstats_format_lf_full = MaxQtoMSstatsPTMFormat(

     evidence = evidence, # enriched runs

     annotation = annotationclean, # enriched runs annotation

     fasta = "E:/phospho_all_Zahraas and mine/day2_TMT_PTM_my attempt_Asus_office/Code/idmapping_2024_09_17.fasta",  

    fasta_protein_name = "uniprot_ac",  

     mod_id = "\\(Phospho \\(STY\\)\\)",  

     use_unmod_peptides = FALSE,  # This needs to be changed to FALSE when you have two reports, one for proteome and one for enriched runs

    labeling_type = "LF",  

     which_proteinid_ptm = "Proteins",

    evidence_prot = evidence_proteome, # proteome runs

annotation_protein = annotation_proteome # proteome runs

)


Let me know if you need any more clarifications.


Thanks,

Tony

Message has been deleted

Sofia Farkona

unread,
Oct 16, 2024, 5:34:06 PM10/16/24
to MSstats
Hello Tony thanks for your response. I had to delete my previous email because (unfortunately) I have more updates. I will get into everything in the same way you did:

1. "FASTA file missing 5225 Proteins. These will be removed. This may be due to non-unique identifications."

 " Could you also include the FASTA file you used?  On Uniprot, depending on the settings you use, the FASTA file generated will distinguish between various isoforms, but I need to confirm whether you follow this case or not."   I shared the previous fasta file I created and used (idmapping_2024_09_17.fasta).

Also the new update: I just tried what you suggested (I think). I uploaded to Retrieve/ID mapping | UniProt  the previous txt file I created using evidence$Leading.razor.protein 

unique_protein_ids <- unique(evidence$Leading.razor.protein) .            When the ID mapping was done, I downloaded them non compressed, format: canonical & isoform and I ended up with file idmapping_2024_10_16.fasta which is 2x bigger than the previous fasta file. 
Naturally I tried to run the conversion as previously: 
msstats_format_lf_full = MaxQtoMSstatsPTMFormat(
    evidence = evidence,
    annotation = annotationclean,  #duplicates are removed from annotation file
    fasta = "E:/phospho_all_Zahraas and mine/day2_TMT_PTM_my attempt_Asus_office/Code/idmapping_2024_10_16.fasta.",  

    fasta_protein_name = "uniprot_ac",  
    mod_id = "\\(Phospho \\(STY\\)\\)",  
    use_unmod_peptides = TRUE,  
    labeling_type = "LF",  

    which_proteinid_ptm = "Proteins"  # Use the 'Proteins' column as in the documentation not the razor proteins
)

Unfortunately, again I ended up with an error calling out duplicates.
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 502226 rows; more than 466240 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

I rerun the conversion w the previous FASTA file and did not get this error but the previous warning: "FASTA file missing 5225 Proteins. These will be removed. This may be due to non-unique identifications" .
Bottom line it is obvious the new FASTA file created the error "Error in vecseq ...........".

2. "For some columns I only get NA"

 "Could you confirm you used DDA?" I confirm I used DDA.  

3. "there are many NA in the Intensity column"

 "Looking at your annotation file, could you confirm you did 16 MS runs on the global proteome, then do 8 MS runs involving enrichment?" I confirm.  


"If so, then the explanation to the NAs & the adjustment needed should make sense below: One reason there are many NAs in intensity is because there are no detected PTMs in the proteome runs." You are right, I should have seen this. So those NA make sense. But will they create problems with the rest of the analysis?

"Intensity is NA for the proteome run corresponding to P49006 with phosphorylation at site 148, which makes sense since there are no PTMs reported in the proteome run. " It makes sense, you are right. But will they create problems with the rest of the analysis?

 "With separate runs on the global proteome and enrichment, you can run MaxQuant's identification and quantification twice, once on the proteome runs and once on the enriched runs.  Then after generating 2 MaxQ reports, one for the proteome and one for the PTM, you can run MSstatsPTM's converter with adjustments to its parameters (highlighted in yellow and bolded below).  In summary, this adjustment will store the global proteome information in msstats_format_lf_full$PROTEIN and the enriched information in msstats_format_lf_full$PTM.  "

 msstats_format_lf_full = MaxQtoMSstatsPTMFormat(

     evidence = evidence, # enriched runs

     annotation = annotationclean, # enriched runs annotation

     fasta = "E:/phospho_all_Zahraas and mine/day2_TMT_PTM_my attempt_Asus_office/Code/idmapping_2024_09_17.fasta",  

    fasta_protein_name = "uniprot_ac",  

     mod_id = "\\(Phospho \\(STY\\)\\)",  

     use_unmod_peptides = FALSE,  # This needs to be changed to FALSE when you have two reports, one for proteome and one for enriched runs

    labeling_type = "LF",  

     which_proteinid_ptm = "Proteins",

    evidence_prot = evidence_proteome, # proteome runs

  annotation_protein = annotation_proteome # proteome runs

)

I did not understand the last part of your answer and so I chose to include a few details for my experimental and MaxQuant set up.

We have 4 experimental conditions, 2 biological replicates for each one. In this way we ended up with 8 “enriched” samples which were run in singletons w LC-MS/MS. We also have 8 “unbound” (or non-enriched) fractions from the same samples that represent the global proteome. These were run in duplicates w LC-MS/MS (in duplicates because there was enough protein for double injection). In essence for each enriched run there are 2 global proteome runs.  Then the raw data (16+8) were run in MaxQuant as suggested by the authors (DOI: 10.1038/nprot.2016.136).

I attached a word document showing the MaxQuant set up: Essentially all raw files are uploaded but the "global proteomes ones" are specified as group 0 and the "enriched" one as specified as group 1. Also, here I set PTMs as TRUE only for the enriched ones (group 1). Then for group specific parameters all is the same for both groups except for the enriched ones (group 1), in the modifications tab, we add Phosho(STY) as variable modifications. A few more details can be seen in the attached word document.

Based on the fact that 

a) it makes sense to get NA in the proteome runs

b) my experimental and MaxQuant set up, do you still think I should run MaxQuant in the way you described above (and modify the code as above)?

Again, long email but I am trying to make sure I understand what you are suggesting and that we are on the same page.

Best,

Sofia



  
idmapping_2024_09_17.fasta
MaxQuant_set up_attachedOct 15th 2024.docx
idmapping_2024_09_17.fasta

Anthony Wu

unread,
Oct 31, 2024, 9:09:06 AM10/31/24
to MSstats
Hi Sofia, 

Not sure if my previous message went through.  Are you available next week for about an hour to discuss this problem on a zoom call?  I think that would help quickly resolve the issues.

Thanks,
Tony

Sofia Farkona

unread,
Nov 1, 2024, 9:51:47 AM11/1/24
to MSstats
Hello Tony thanks for getting back to me. I would definitely like to meet and discuss the problem through zoom, thanks for suggesting this. For the coming week starting on Nov 4th I am available any time between Tuesday-Thursday (apart Thursday 11am). We are in equivalent time zones, and time is changing for both Boston and Toronto on Nov 3rd so we should be good this way.

Is it OK if I include our bioinformatician in the discussion? I want to make sure we will make the most your time and I would like to have him on board for obvious reasons. Could you recommend a few time slots that could work for you?

A few updates since my last email: I tried to work on this:
"With separate runs on the global proteome and enrichment, you can run MaxQuant's identification and quantification twice, once on the proteome runs and once on the enriched runs.  Then after generating 2 MaxQ reports, one for the proteome and one for the PTM, you can run MSstatsPTM's converter with adjustments to its parameters (highlighted in yellow and bolded below).  In summary, this adjustment will store the global proteome information in msstats_format_lf_full$PROTEIN and the enriched information in msstats_format_lf_full$PTM. "

-I ran the global proteome raw data separately in MaxQuant: All good there I got the report.
-I tried to run the enriched (Phos) raw data separately in MaxQuant multiple times: MaxQuant keeps stopping! Unable to generate the respective report. I checked the respective MaxQuant's log files and I found this:  But I found the following: "Label-free_normalization 11.error" which I assume points to an issue specifically with the label-free normalization in the phospho-enriched run. I will make sure to look at it more before our meeting.

Looking forward to the discussion. Please let me know when.

Best,
Sofia

Sofia Farkona

unread,
Nov 1, 2024, 2:28:32 PM11/1/24
to MSstats
For the meeting next week (starting Nov 4th 2024) our time slots are: anytime Wednesday, Thursday after 12, or Friday from 9am-11am.

Looking forward
Best,
S

Anthony Wu

unread,
Nov 1, 2024, 7:00:21 PM11/1/24
to MSstats
Hi,

I emailed you a meeting invite for 11/6 3:30pm

Thanks,
Tony

Sofia Farkona

unread,
Nov 4, 2024, 2:19:33 PM11/4/24
to MSstats
Hey Tony I got it, thank you. See you then.

Best,
Sofia

Sofia Farkona

unread,
Jan 7, 2025, 7:21:14 PMJan 7
to MSstats

Hello Tony (and everyone). It's been some time, but I finally got to implement the suggestions that were given to us in our previous meeting. The most important one was to run our phospho enriched data and our global proteome data SEPARATELY in MaxQuant. From this I was able to get 2 "evidence" files. One from the "enriched/phospho" runs and one from the proteome runs.  
Then I created the fasta file using the unique razor leading proteins from the enriched/phospho evidence file:                                                                              unique_protein_ids <- unique(evidence$Leading.razor.protein). I uploaded these proteins to Retrieve/ID mapping | UniProt   and I made sure the "from Database" is UniProtKB AC/ID and the "to Database is UniProtKB/Swiss-Prot". After the mapping was done I downloaded the FASTA file (and it was smaller than last time, but it makes sense since we made it from a smaller evidence file).
I also made two different annotation files, one using the evidence from the enriched/phospho runs and a second one using the evidence from the proteome run. I made sure I removed duplicate rows from both of them.
I believed I was ready to run the suggested code: 

 msstats_format_lf_full = MaxQtoMSstatsPTMFormat(

     evidence = evidence, # enriched runs

     annotation = annotationclean, # enriched runs annotation

     fasta = "E:/phospho_all_Zahraas and mine/day2_TMT_PTM_my attempt_Asus_office/Code/idmapping_2024_09_17.fasta",  

    fasta_protein_name = "uniprot_ac",  

     mod_id = "\\(Phospho \\(STY\\)\\)",  

     use_unmod_peptides = FALSE,  # This needs to be changed to FALSE when you have two reports, one for proteome and one for enriched runs

    labeling_type = "LF",  

     which_proteinid_ptm = "Proteins",

    evidence_prot = evidence_proteome, # proteome runs

annotation_protein = annotation_proteome # proteome runs



Unfortunately I am getting 2 different concerning messages. 
A) This one resembles what I was getting before:
[1] "FASTA file missing 1577 Proteins. These will be removed. This may be due to non-unique identifications."
I am planning to investigate which are these proteins that are missing. 


[2] getting this error too (I don't understand at all, did not have this previously and I checked every single file I am using):
Error in data.table::fread(input, showProgress = FALSE, ...) :
  input= must be a single character string containing a file name, a system command containing at least one space, a URL starting 'http[s]://', 'ftp[s]://' or 'file://', or, the input data itself containing at least one \n or \r

The evidence files are too big to upload here but I can upload smaller versions, I guess. Any extra advice will be appreciated. Also, I know I am hoping for too much but if there is any possibility we could meet again it would be great and it would depend on your schedule.

Best,
S

Talia Head

unread,
Jan 11, 2025, 9:36:10 PMJan 11
to MSstats
Hi I am using PD instead of MaxQuant but am following along because I am having some similar issues.
I'm not sure about your first error but for your second error it seems like it's looking to be pointed towards an "input" which is was the PDtoMSstats uses instead of "evidence"..

try changing 'evidence = ' to 'input = ' and similarly change 'evidence_prot = ' to 'protein_input = '

Let me know if that works? I could be way off here, and don't know why that would've been changed to input for MaxQ but it seems worth a try

Anthony Wu

unread,
Jan 16, 2025, 10:55:13 AMJan 16
to MSstats
Hi Sofia,

I'd be happy to meet again.  Next week should work for me.  Could you give me some available time slots and I'll confirm a meeting time and zoom link.

Before we meet, could you also attach your files (subset of evidence, annotation, FASTA) and I can take a preliminary look to debug what could be going wrong.   

Thanks,
Tony

Sofia Farkona

unread,
Jan 16, 2025, 11:44:55 AMJan 16
to MSstats
Hello Tony thank you so much for willing to meet with us again. Is there any way we could move it for some time between January27th - January30th? I would like our bioinformatician to be able to join as well and she cannot make it earlier. As for time slots so far, it is pretty open for us, you let us know when it is better for you. Before our meeting I will share the information you asked for and the steps I followed to troubleshoot.

Best,
Sofia

Anthony Wu

unread,
Jan 16, 2025, 2:58:03 PMJan 16
to MSstats
Hi Sofia,

That week works for me.  Does January 30 at 3:30pm EST work for you?  If so, I'll email you the zoom link directly.

Thanks,
Tony

Sofia Farkona

unread,
Jan 20, 2025, 3:42:38 PMJan 20
to MSstats
Hello Tony it is working for 2/3 of us. Need to ask one more person from our team to make sure she can attend.
I will confirm tomorrow.
Thans a lot.
Best,
Sofia

Sofia Farkona

unread,
Jan 23, 2025, 9:46:35 PMJan 23
to MSstats
Hey Tony, we were wondering whether you can meet on that day (January 30th ) but a bit earlier than what you suggested. For example, could we start at 2pm? Or at 3pm (the latest)? Let us know :).
Best,
Sofia

Anthony Wu

unread,
Jan 24, 2025, 12:05:23 PMJan 24
to MSstats
Hi, 

I can start at 2:30pm if that works for you.  3pm also works too.

Tony

Sofia Farkona

unread,
Jan 24, 2025, 2:28:29 PMJan 24
to MSstats
Great, 2.30pm works better. Thanks so much. Feel free to send zoom invite to my gmail.


Thanks so much Tony,
Best,
Sofia

Sofia Farkona

unread,
Mar 24, 2025, 4:39:10 PMMar 24
to MSstats
Hello all I should give updates regarding the issues we have been experiencing here.

The main two I issues I reported back in January were:
1)
"FASTA file missing 1577 Proteins. These will be removed. This may be due to non-unique identifications."
It seems that we have been encountering this due to the fact that when we ran our raw data with Maxquant we used a fasta file from Uniprot that included isoforms. However, the fasta file we used with the code that is used for the conversion (with the function MaxQtoMSstatsPTMFormat()) did not include the isoforms. Using a Uniprot fasta file without isoforms (when analyzing raw data with MaxQuant) solved most of this problem.


2)
Error in data.table::fread(input, showProgress = FALSE, ...) :
  input= must be a single character string containing a file name, a system command containing at least one space, a URL starting 'http[s]://', 'ftp[s]://' or 'file://', or, the input data itself containing at least one \n or \r

We missed including the last line in the code below that is used for the conversion

msstats_format_lf_full = MaxQtoMSstatsPTMFormat(
    evidence = evidence,                           # Enriched runs evidence
    annotation = annotationclean,                 # Enriched runs annotation
    fasta = "E:/phospho_all_Zahraas and mine/day2_TMT_PTM_my attempt_Asus_office/CodeFeb19th2025/idmapping_2025_02_20.fasta",  
    fasta_protein_name = "uniprot_ac",            # Protein identifier format
    mod_id = "\\(Phospho \\(STY\\)\\)",           # Modification ID for Phospho(STY)
    use_unmod_peptides = FALSE,                   # Ensure FALSE for separate reports
    labeling_type = "LF",                         # Label-free quantification
    which_proteinid_ptm = "Proteins",             # Use the 'Proteins' column for PTMs
    evidence_prot = evidence_pro,                 # Proteome runs evidence
    annotation_protein = annotation_pro_clean,     # Proteome runs annotation
    proteinGroups = proteinGroups # additional parameter - protein groups file from MaxQ
)

Once we included this, the conversion with the function MaxQtoMSstatsPTMFormat() was successful.

Thanks SO MUCH Tony for helping us to troubleshoot.! I really appreciate it.
Reply all
Reply to author
Forward
0 new messages