Very low number of identifications Comet+PeptideProphet

59 views
Skip to first unread message

Valeriia Vasylieva

unread,
Mar 13, 2025, 8:02:07 PMMar 13
to spctools-discuss
Hi.
I run Comet and Peptideprophet on two public datasets with TDC with UniProt. I calculated the q-value in Python based on fval distribution and filtered data with a threshold 1%. Like that:
#df - PeptideProphet outputdf = df.sort_values(by='fval', ascending=False).reset_index(drop=True)
# Calculate cumulative counts of targets and decoys df['cum_targets'] = (df['database'] == 'T').cumsum() df['cum_decoys'] = (df['database'] == 'D').cumsum() # Calculate FDR df['FDR'] = df['cum_decoys'] / df['cum_targets'] # cumulative minimum from bottom to top df['q-value'] = df['FDR'][::-1].cummin()[::-1]

Lower you see the proportion of PSMs annotated as targets and decoys which passed the value threshold or not.  One of the datasets (PXD03594) has a very low number of identifications. It also has a wide distribution of decoys (on the graph the raw files are plotted together). Could anyone suggest what could have happened here? I used default parameters, just changed peptide length from 7 to 30 aa, and peptide mass range 500.0-6000.0, and also enabled Methioning clipping. 
Thanks!

Capture9.PNGCapture8.PNG


Capture6.PNGCapture7.PNG


Jimmy Eng

unread,
Mar 14, 2025, 6:08:09 PMMar 14
to spctools...@googlegroups.com
Valeriia,

That PXD003594 experiment has 76 raw files associated with.  Were there a subset of raw files that you analyzed here or does your analysis include all 76 runs?    Just for a quick test, I downloaded 4 raw files from that experiment and searched it with Comet against the UniProt human database.  Here's a very basic summary:
  •    b1369p080_sample_01_a.raw:  high res MS/MS, almost no IDs (less than 100 positive PSMs)
  • b1369p601_DMSO_G1_B2_S10.RAW:  ion trap MS/MS, ~6000 PSMs at 1% error rate
  •  b1369p601_GDC_G2_B3_S23.RAW:  ion trap MS/MS, ~6000 PSMs at 1% error rate
  •           b1369p65_PP4_R.RAW:  ion trap MS/MS, ~500 PSMs at 1% error rate
So there are runs with thousands of good PSM IDs in them.  Note that in the 4 raw files that I sampled, there was a mix of high-res and low-res MS/MS spectra so hopefully you adjusted the fragment ion settings appropriately for each raw file using my suggested parameter settings shown below.  If the fragment ion settings aren't the issue, feel free to follow-up including attaching the contents of your comet.params file.

high-res:
   fragment_bin_tol = 0.02 
   fragment_bin_offset = 0.0
   theoretical_fragment_ions = 0


low-res: 
   fragment_bin_tol = 1.0005
   fragment_bin_offset = 0.4
   theoretical_fragment_ions = 1


Jimmy

--
You received this message because you are subscribed to the Google Groups "spctools-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spctools-discu...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/spctools-discuss/e8201f2e-0fd1-4c36-ac34-dbeada186d13n%40googlegroups.com.

Valeriia Vasylieva

unread,
Mar 19, 2025, 11:48:15 PMMar 19
to spctools-discuss
Hi Jimmy,

Thank you for your response!

I worked with kind of samples like b1369p080_sample_01_a . There are 14 of them, each divided to fractions _a and _b. I merged the fractions and did search on 14 raw files.
I did not take into account that those are high-res MS/MS. Thank you for a valuable notice and suggested parameters. 
The fact that you also identified just a few high-scoring PSMs suggests that maybe my input parameters were not far from yours.I attach comet.params file. I will appreciate your comments on it.

Valeriia

# comet_version 2021.01 rev. 0
# Comet MS/MS search engine parameters file.
# Everything following the '#' symbol is treated as a comment.

database_name =
decoy_search = 1                       # 0=no (default), 1=concatenated search, 2=separate search
peff_format = 0                        # 0=no (normal fasta, default), 1=PEFF PSI-MOD, 2=PEFF Unimod
peff_obo =                             # path to PSI Mod or Unimod OBO file

num_threads = 0                        # 0=poll CPU to set num threads; else specify num threads directly (max 128)

#
# masses
#
peptide_mass_tolerance = 20.00
peptide_mass_units = 2                 # 0=amu, 1=mmu, 2=ppm
mass_type_parent = 1                   # 0=average masses, 1=monoisotopic masses
mass_type_fragment = 1                 # 0=average masses, 1=monoisotopic masses
precursor_tolerance_type = 1           # 0=MH+ (default), 1=precursor m/z; only valid for amu/mmu tolerances
isotope_error = 1                      # 0=off, 1=0/1 (C13 error), 2=0/1/2, 3=0/1/2/3, 4=-8/-4/0/4/8 (for +4/+8 labeling)

#
# search enzyme
#
search_enzyme_number = 1               # choose from list at end of this params file
search_enzyme2_number = 0              # second enzyme; set to 0 if no second enzyme
num_enzyme_termini = 2                 # 1 (semi-digested), 2 (fully digested, default), 8 C-term unspecific , 9 N-term unspecific
allowed_missed_cleavage = 2            # maximum value is 5; for enzyme search

#
# Up to 9 variable modifications are supported
# format:  <mass> <residues> <0=variable/else binary> <max_mods_per_peptide> <term_distance> <n/c-term> <required> <neutral_loss>
#     e.g. 79.966331 STY 0 3 -1 0 0 97.976896
#
variable_mod01 = 15.9949 M 0 3 -1 0 0 0.0
variable_mod02 = 0.0 X 0 3 -1 0 0 0.0
variable_mod03 = 0.0 X 0 3 -1 0 0 0.0
variable_mod04 = 0.0 X 0 3 -1 0 0 0.0
variable_mod05 = 0.0 X 0 3 -1 0 0 0.0
variable_mod06 = 0.0 X 0 3 -1 0 0 0.0
variable_mod07 = 0.0 X 0 3 -1 0 0 0.0
variable_mod08 = 0.0 X 0 3 -1 0 0 0.0
variable_mod09 = 0.0 X 0 3 -1 0 0 0.0
max_variable_mods_in_peptide = 5
require_variable_mod = 0

#
# fragment ions
#
# ion trap ms/ms:  1.0005 tolerance, 0.4 offset (mono masses), theoretical_fragment_ions = 1
# high res ms/ms:    0.02 tolerance, 0.0 offset (mono masses), theoretical_fragment_ions = 0, spectrum_batch_size = 15000
#

fragment_bin_tol = 1.0005              # binning to use on fragment ions
fragment_bin_offset = 0.0              # offset position to start the binning (0.0 to 1.0)
theoretical_fragment_ions = 0          # 0=use flanking peaks, 1=M peak only
use_A_ions = 0
use_B_ions = 1
use_C_ions = 0
use_X_ions = 0
use_Y_ions = 1
use_Z_ions = 0
use_Z1_ions = 0
use_NL_ions = 1                        # 0=no, 1=yes to consider NH3/H2O neutral loss peaks

#
# output
#
output_sqtfile = 0                     # 0=no, 1=yes  write sqt file
output_txtfile = 0                     # 0=no, 1=yes  write tab-delimited txt file
output_pepxmlfile = 1                  # 0=no, 1=yes  write pepXML file
output_mzidentmlfile = 0               # 0=no, 1=yes  write mzIdentML file
output_percolatorfile = 0              # 0=no, 1=yes  write Percolator pin file
print_expect_score = 1                 # 0=no, 1=yes to replace Sp with expect in out & sqt
num_output_lines = 5                   # num peptide results to show

sample_enzyme_number = 1               # Sample enzyme which is possibly different than the one applied to the search.
                                       # Used to calculate NTT & NMC in pepXML output (default=1 for trypsin).

#
# mzXML parameters
#
scan_range = 0 0                       # start and end scan range to search; either entry can be set independently
precursor_charge = 0 0                 # precursor charge range to analyze; does not override any existing charge; 0 as 1st entry ignores parameter
override_charge = 0                    # 0=no, 1=override precursor charge states, 2=ignore precursor charges outside precursor_charge range, 3=see online
ms_level = 2                           # MS level to analyze, valid are levels 2 (default) or 3
activation_method = ALL               # activation method; used if activation method set; allowed ALL, CID, ECD, ETD, ETD+SA, PQD, HCD, IRMPD, SID

#
# misc parameters
#
digest_mass_range = 500.0 6000.0       # MH+ peptide mass range to analyze
peptide_length_range = 7 30            # minimum and maximum peptide length to analyze (default 1 63; max length 63)
num_results = 50                      # number of search hits to store internally
max_duplicate_proteins = 20            # maximum number of additional duplicate protein names to report for each peptide ID; -1 reports all duplicates
max_fragment_charge = 3                # set maximum fragment charge state to analyze (allowed max 5)
max_precursor_charge = 6               # set maximum precursor charge state to analyze (allowed max 9)
nucleotide_reading_frame = 0           # 0=proteinDB, 1-6, 7=forward three, 8=reverse three, 9=all six
clip_nterm_methionine = 1              # 0=leave sequences as-is; 1=also consider sequence w/o N-term methionine
spectrum_batch_size = 30000            # max. # of spectra to search at a time; 0 to search the entire scan range in one loop
decoy_prefix = DECOY_                  # decoy entries are denoted by this string which is pre-pended to each protein accession
equal_I_and_L = 1                      # 0=treat I and L as different; 1=treat I and L as same
output_suffix =                        # add a suffix to output base names i.e. suffix "-C" generates base-C.pep.xml from base.mzXML input
mass_offsets =                         # one or more mass offsets to search (values substracted from deconvoluted precursor mass)
precursor_NL_ions =                    # one or more precursor neutral loss masses, will be added to xcorr analysis

#
# spectral processing
#
minimum_peaks = 10                     # required minimum number of peaks in spectrum to search (default 10)
minimum_intensity = 0                  # minimum intensity value to read in
remove_precursor_peak = 0              # 0=no, 1=yes, 2=all charge reduced precursor peaks (for ETD), 3=phosphate neutral loss peaks
remove_precursor_tolerance = 1.5       # +- Da tolerance for precursor removal
clear_mz_range = 0.0 0.0               # for iTRAQ/TMT type data; will clear out all peaks in the specified m/z range

#
# additional modifications
#

add_Cterm_peptide = 0.0
add_Nterm_peptide = 0.0
add_Cterm_protein = 0.0
add_Nterm_protein = 0.0

add_G_glycine = 0.0000                 # added to G - avg.  57.0513, mono.  57.02146
add_A_alanine = 0.0000                 # added to A - avg.  71.0779, mono.  71.03711
add_S_serine = 0.0000                  # added to S - avg.  87.0773, mono.  87.03203
add_P_proline = 0.0000                 # added to P - avg.  97.1152, mono.  97.05276
add_V_valine = 0.0000                  # added to V - avg.  99.1311, mono.  99.06841
add_T_threonine = 0.0000               # added to T - avg. 101.1038, mono. 101.04768
add_C_cysteine = 57.021464             # added to C - avg. 103.1429, mono. 103.00918
add_L_leucine = 0.0000                 # added to L - avg. 113.1576, mono. 113.08406
add_I_isoleucine = 0.0000              # added to I - avg. 113.1576, mono. 113.08406
add_N_asparagine = 0.0000              # added to N - avg. 114.1026, mono. 114.04293
add_D_aspartic_acid = 0.0000           # added to D - avg. 115.0874, mono. 115.02694
add_Q_glutamine = 0.0000               # added to Q - avg. 128.1292, mono. 128.05858
add_K_lysine = 0.0000                  # added to K - avg. 128.1723, mono. 128.09496
add_E_glutamic_acid = 0.0000           # added to E - avg. 129.1140, mono. 129.04259
add_M_methionine = 0.0000              # added to M - avg. 131.1961, mono. 131.04048
add_H_histidine = 0.0000               # added to H - avg. 137.1393, mono. 137.05891
add_F_phenylalanine = 0.0000           # added to F - avg. 147.1739, mono. 147.06841
add_U_selenocysteine = 0.0000          # added to U - avg. 150.0379, mono. 150.95363
add_R_arginine = 0.0000                # added to R - avg. 156.1857, mono. 156.10111
add_Y_tyrosine = 0.0000                # added to Y - avg. 163.0633, mono. 163.06333
add_W_tryptophan = 0.0000              # added to W - avg. 186.0793, mono. 186.07931
add_O_pyrrolysine = 0.0000             # added to O - avg. 237.2982, mono  237.14773
add_B_user_amino_acid = 0.0000         # added to B - avg.   0.0000, mono.   0.00000
add_J_user_amino_acid = 0.0000         # added to J - avg.   0.0000, mono.   0.00000
add_X_user_amino_acid = 0.0000         # added to X - avg.   0.0000, mono.   0.00000
add_Z_user_amino_acid = 0.0000         # added to Z - avg.   0.0000, mono.   0.00000

#
# COMET_ENZYME_INFO _must_ be at the end of this parameters file
#
[COMET_ENZYME_INFO]
0.  Cut_everywhere         0      -           -
1.  Trypsin                1      KR          P
2.  Trypsin/P              1      KR          -
3.  Lys_C                  1      K           P
4.  Lys_N                  0      K           -
5.  Arg_C                  1      R           P
6.  Asp_N                  0      D           -
7.  CNBr                   1      M           -
8.  Glu_C                  1      DE          P
9.  PepsinA                1      FL          P
10. Chymotrypsin           1      FWYL        P
11. No_cut                 1      @           @



суббота, 15 марта 2025 г. в 00:08:09 UTC+2, Jimmy Eng:

Jimmy Eng

unread,
Mar 20, 2025, 7:24:45 PMMar 20
to spctools-discuss
Your search parameters look fine except for using low-res MS/MS settings that you need to change to high-res settings.  

I just performed an analysis on all 14 of the "b1369p080_sample_*_a.raw" files.  The only difference is that I'm using the latest 2025.01.1 version of Comet although I'm moderately confident that I'd see the same results with the 2021 release of Comet.  The plot below is an q-value plot based on either sorting the raw Comet results by either xcorr or expectation value.  There are ~3500 or ~4500 PSMs at a 1% FDR depending on which isn't a whole lot for 14 runs.  If you would've extended your peptide length range from 30 to 50, I believe there would be an additional 10% IDs picked up.  These numbers are within the ballpark of your FDR analysis where the number of targets and decoy PSMs were nearly equal which means I don't think you did anything wrong.

xcorr.png


As a public service announcement, consider updating your Comet from the 2021 release (as implied by the version listed in the comet.params file).  Assuming you're using Comet with the TPP on a Windows box, you can do so by simply replacing "comet.exe" in C:\TPP\bin\ with the "comet.win64.exe" that you download from here.  Rename your current binary something "comet.2021.exe", so that you can revert back to it if needed, and rename the downloaded binary "comet.exe".  You can then grab example comet.params for v2025 at the top of this page as the file has changed slightly.


Reply all
Reply to author
Forward
0 new messages