Hello all. My colleague and I are currently in the process of analyzing our phosphoproteomics LC-MS/MS-based data. I have been trying and miserably failing to convert these label-free MaxQuant data into MSstatsPTM format. I tried to follow the instructions from the MaxQtoMSstatsPTMFormat R documentation (it can also be seen here new.bioconductor.org/packages/release/bioc/vignettes/MSstatsPTM/inst/doc/MSstatsPTM_LabelFree_Workflow.R and here MSstatsPTM: Statistical Characterization of Post-translational Modifications (bioconductor.org). )
I have the "evidence.txt" file from MaxQuant, I made the annotation file as explained, I have the protein groups.txt file and the Phospho(STY)Sites.txt (although the last one not required), but I have been failing and failing. (I followed the code from the example below).
code from MaxQtoMSstatsPTMFormat R documentation:
msstats_format_lf = MaxQtoMSstatsPTMFormat(evidence=maxq_lf_evidence,
annotation=maxq_lf_annotation,
fasta=system.file("extdata",
"maxq_lf_fasta.fasta",
package="MSstatsPTM"),
fasta_protein_name="uniprot_ac",
mod_id="\\(Phospho \\(STY\\)\\)",
use_unmod_peptides=TRUE,
labeling_type = "LF",
which_proteinid_ptm = "Proteins")
When I tried to run, I got the following error in the console: "Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in 846403 rows; more than 377678 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice."From my understanding the error is somehow relevant to duplicate protein IDs in the evidence file? But these duplicated protein IDs are there because of the way data are reported in Evidence file. For example, a peptide sequence (reported in Sequence column) will be reported multiple times if it is identified in the different experiments (column experiment) or in the different raw file (column Raw.file). In this way the respective protein ID will also be reported multiple times? Am I wrong? I have run MaxQuant multiple time, this is how evidence file would always look.
I tried to run with allow cartesian=TRUE; did not work either.
Any help will be appreciated
Best,
S
Hello Tony, I am back with updates. I had some progress during the last days after, as you suggested, I had removed the duplicate rows from the annotation file.
I did this firstly using the smaller versions of my files (first 100 rows from evidence and annotation file, annotation file always cleaned from duplicate rows). I gradually increased the size of the files and finally I ran the conversion for the full files. Below you can see how this went:
> msstats_format_lf_full =
MaxQtoMSstatsPTMFormat(
+ evidence = evidence,
+ annotation = annotationclean,
+ fasta = "E:/phospho_all_Zahraas and mine/day2_TMT_PTM_my attempt_Asus_office/Code/idmapping_2024_09_17.fasta",
+ fasta_protein_name = "uniprot_ac",
+ mod_id = "\\(Phospho \\(STY\\)\\)",
+ use_unmod_peptides = TRUE,
+ labeling_type = "LF",
+ which_proteinid_ptm = "Proteins" # Use the 'Proteins' column as in the documentation not the razor proteins
+ )
Question 1
I did not get any obvious errors in the console, but during the conversion I saw this message in yellow:
“[1] "FASTA file missing 5225 Proteins. These will be removed. This may be due to non-unique identifications."
This is concerning and so I would like to make sure that I made the FASTA file correctly. I made the FASTA file using “leading.razor.protein” variable (column) from evidence file: unique_protein_ids <- unique(evidence$Leading razor protein). Maybe this was NOT correct? Maybe I should have made it using evidence$Protein What do you think?
……………………..
Question 2
About the resulted my msstats_format_lf_full$PROTEIN and msstats_format_lf_full$PTM data frames: I get many observations (rows) in both of them:
> nrow(msstats_format_lf_full$PROTEIN)
[1] 280944
> nrow(msstats_format_lf_full$PTM)
[1] 20112
Each data frame has 11 variables or columns, but in some of these columns I get lots of NA. From msstats_format_lf_full$PROTEIN for some columns I only get NA (columns: PrecursorCharge, Fragmentation). Also, there are many NA in the Intensity column (could that be because “FASTA file misses 5225 Proteins as it was mentioned earlier?
Here is an example for msstats_format_lf_full$PROTEIN:
> msstats_format_lf_full$PROTEIN %>% head(5)
ProteinName PeptideSequence PrecursorCharge FragmentIon ProductCharge IsotopeLabelType Condition
1 P36578 AAAAAAALQAK 2 NA NA L SN1_proteomeJN19
2 P36578 AAAAAAALQAK 2 NA NA L SN1_proteomeJN19_rep
3 P36578 AAAAAAALQAK 2 NA NA L SN1_proteomeJN4
4 P36578 AAAAAAALQAK 2 NA NA L SN1_proteomeJN4_rep
5 P36578 AAAAAAALQAK 2 NA NA L SN2_proteomeJN19
BioReplicate Run Fraction Intensity
1 SN1_proteomeJN19 SN1_proteomeJN19 1 1124000000
2 SN1_proteomeJN19_rep SN1_proteomeJN19_rep 1 838480000
3 SN1_proteomeJN4 SN1_proteomeJN4 1 955310000
4 SN1_proteomeJN4_rep SN1_proteomeJN4_rep 1 642690000
5 SN2_proteomeJN19 SN2_proteomeJN19 1 572790000
About msstats_format_lf_full$PTM I also get many NA, in Intensity column as well
> msstats_format_lf_full$PTM %>% head(5)
ProteinName PeptideSequence PrecursorCharge FragmentIon ProductCharge IsotopeLabelType
337 P49006_T148 AAAT(Phospho (STY))PESQEPQAK 2 NA NA L
338 P49006_T148 AAAT(Phospho (STY))PESQEPQAK 2 NA NA L
339 P49006_T148 AAAT(Phospho (STY))PESQEPQAK 2 NA NA L
340 P49006_T148 AAAT(Phospho (STY))PESQEPQAK 2 NA NA L
341 P49006_T148 AAAT(Phospho (STY))PESQEPQAK 2 NA NA L
Condition BioReplicate Run Fraction Intensity
337 SN1_proteomeJN19 SN1_proteomeJN19 SN1_proteomeJN19 1 NA
338 SN1_proteomeJN19_rep SN1_proteomeJN19_rep SN1_proteomeJN19_rep 1 NA
339 SN1_proteomeJN4 SN1_proteomeJN4 SN1_proteomeJN4 1 NA
340 SN1_proteomeJN4_rep SN1_proteomeJN4_rep SN1_proteomeJN4_rep 1 NA
341 SN2_proteomeJN19 SN2_proteomeJN19 SN2_proteomeJN19 1 NA
To summarize, although there was a significant progress, it seems that some things are still not optimized. I would love to know what changes I should try to get the “msstats_format_lf_full” as it should be and continue with the analysis.
Best,
Sofia
Continuing from the previous email, I have attached here a zip folder containing the smaller versions of the files I got from MaxQuant (first 5000 rows). So I included evidence5000, created the annotation file using evidence5000 and removed duplicate rows. I also included phospho_STY_Sites and proteinsGroups from MaxQuant (first 5000 rows) just in case they are useful.
I will appreciate a lot your feedback.
Best,
S
msstats_format_lf_full = MaxQtoMSstatsPTMFormat(
evidence = evidence, # enriched runs
annotation = annotationclean, # enriched runs annotation
fasta = "E:/phospho_all_Zahraas and mine/day2_TMT_PTM_my attempt_Asus_office/Code/idmapping_2024_09_17.fasta",
fasta_protein_name = "uniprot_ac",
mod_id = "\\(Phospho \\(STY\\)\\)",
use_unmod_peptides = FALSE, # This needs to be changed to FALSE when you have two reports, one for proteome and one for enriched runs
labeling_type = "LF",
which_proteinid_ptm = "Proteins",
evidence_prot = evidence_proteome, # proteome runs
annotation_protein = annotation_proteome # proteome runs
)
Let me know if you need any more clarifications.
Thanks,
Tony
1. "FASTA file missing 5225 Proteins. These will be removed. This may be due to non-unique identifications."
" Could you also include the FASTA file you used? On Uniprot, depending on the settings you use, the FASTA file generated will distinguish between various isoforms, but I need to confirm whether you follow this case or not." I shared the previous fasta file I created and used (idmapping_2024_09_17.fasta).
2. "For some columns I only get NA"
"Could you confirm you used DDA?" I confirm I used DDA.
3. "there are many NA in the Intensity column"
"Looking at your annotation file, could you confirm you did 16 MS runs on the global proteome, then do 8 MS runs involving enrichment?" I confirm.
"Intensity is NA for the proteome run corresponding to P49006 with phosphorylation at site 148, which makes sense since there are no PTMs reported in the proteome run. " It makes sense, you are right. But will they create problems with the rest of the analysis?
"With separate runs on the global proteome and enrichment, you can run MaxQuant's identification and quantification twice, once on the proteome runs and once on the enriched runs. Then after generating 2 MaxQ reports, one for the proteome and one for the PTM, you can run MSstatsPTM's converter with adjustments to its parameters (highlighted in yellow and bolded below). In summary, this adjustment will store the global proteome information in msstats_format_lf_full$PROTEIN and the enriched information in msstats_format_lf_full$PTM. "
msstats_format_lf_full = MaxQtoMSstatsPTMFormat(
evidence = evidence, # enriched runs
annotation = annotationclean, # enriched runs annotation
fasta = "E:/phospho_all_Zahraas and mine/day2_TMT_PTM_my attempt_Asus_office/Code/idmapping_2024_09_17.fasta",
fasta_protein_name = "uniprot_ac",
mod_id = "\\(Phospho \\(STY\\)\\)",
use_unmod_peptides = FALSE, # This needs to be changed to FALSE when you have two reports, one for proteome and one for enriched runs
labeling_type = "LF",
which_proteinid_ptm = "Proteins",
evidence_prot = evidence_proteome, # proteome runs
annotation_protein = annotation_proteome # proteome runs
)
I did not understand the last part of your answer and so I chose to include a few details for my experimental and MaxQuant set up.
We have 4 experimental conditions, 2 biological replicates for each one. In this way we ended up with 8 “enriched” samples which were run in singletons w LC-MS/MS. We also have 8 “unbound” (or non-enriched) fractions from the same samples that represent the global proteome. These were run in duplicates w LC-MS/MS (in duplicates because there was enough protein for double injection). In essence for each enriched run there are 2 global proteome runs. Then the raw data (16+8) were run in MaxQuant as suggested by the authors (DOI: 10.1038/nprot.2016.136).
I attached a word document showing the MaxQuant set up: Essentially all raw files are uploaded but the "global proteomes ones" are specified as group 0 and the "enriched" one as specified as group 1. Also, here I set PTMs as TRUE only for the enriched ones (group 1). Then for group specific parameters all is the same for both groups except for the enriched ones (group 1), in the modifications tab, we add Phosho(STY) as variable modifications. A few more details can be seen in the attached word document.
Based on the fact that
a) it makes sense to get NA in the proteome runs
b) my experimental and MaxQuant set up, do you still think I should run MaxQuant in the way you described above (and modify the code as above)?
Again, long email but I am trying to make sure I understand what you are suggesting and that we are on the same page.
Best,
Sofia
msstats_format_lf_full = MaxQtoMSstatsPTMFormat(
evidence = evidence, # enriched runs
annotation = annotationclean, # enriched runs annotation
fasta = "E:/phospho_all_Zahraas and mine/day2_TMT_PTM_my attempt_Asus_office/Code/idmapping_2024_09_17.fasta",
fasta_protein_name = "uniprot_ac",
mod_id = "\\(Phospho \\(STY\\)\\)",
use_unmod_peptides = FALSE, # This needs to be changed to FALSE when you have two reports, one for proteome and one for enriched runs
labeling_type = "LF",
which_proteinid_ptm = "Proteins",
evidence_prot = evidence_proteome, # proteome runs
annotation_protein = annotation_proteome # proteome runs