Runtime Discrepancy in Comet Searches: Unspecific vs. Pre-digested Database

4 views
Skip to first unread message

沈丹青

unread,
Dec 15, 2025, 3:12:12 AM (yesterday) Dec 15
to Comet ms/ms db search support
Dear Jimmy Eng,

I have encountered some questions while using Comet for immunopeptidomics database searching. 

I performed two separate search runs. For the first search, the reference database contained approximately 370,000 sequences (denoted as DB1), each around 50 amino acids in length, and I set the parameter search_enzyme_number = 0 to allow unspecific cleavage. For the second search, I used a reference database derived from DB1, consisting of all possible 8–25 amino acid peptides generated by in silico unspecific cleavage of the sequences in DB1 (totaling around 100 million entries), and set search_enzyme_number = 11 for "no cut." All other parameters were identical between the two runs.

The first search completed in about 4 hours, whereas the second took approximately 30 hours. Based on my understanding, the second search essentially manually pre-digests the database in an unspecific manner and should therefore be comparable to the first search in principle. I am thus curious why there is such a significant difference in runtime. Does Comet internally employ search-space reduction algorithms—such as an “optimized sliding window approach”—to improve efficiency when search_enzyme_number = 0 is set?

If I only have access to a pre-digested peptide database, is it feasible to directly use the “no cut” setting? If so, what modifications could be made to reduce computational time? Alternatively, do you have any other suggestions for improving search efficiency under such conditions?

Thank you for your time and support!

Best regards,
Danqing Shen 

Jimmy Eng

unread,
1:10 PM (5 hours ago) 1:10 PM
to Comet ms/ms db search support
Danqing,

I believe the difference in search times that your observing is primarily due to how Comet is coded where a search function is invoked on each database sequence.  And if you made each of your all possible 8-25 amino acid peptides their own separate FASTA sequence entry, that huge number of individual sequence entries (composed of a single peptide each) requires invoking a huge number of separate search calls (with associated overhead), is likely the reason for the slower search time.  Two suggestions to try:
  1. When generating your all possible 8-25 amino acid peptides, append those peptides together separated by the "*" (asterisk) character.  This allows you to generate your own peptides and use the "no cut" enzyme option without each peptide needing to be its own sequence entry.  Each FASTA sequence would look something like "HEIEYLTK*QLDTLRLV*NEQHAKVY*EQLDLTARDLE*LTNQRLVMES*KAAQQKIHGL*TETIE ...".  I haven't tested this myself so you might wan to break each sequence when it hits 100K or 200K in length.  The larger each sequence length is the better (more efficient).  For searches like this, it's also more efficient if your "spectrum_batch_size" is as large as your free memory allows.  Then run a "no cut" search against this FASTA.  This search should be faster than ~30 hours but I wouldn't be able to guess how much faster it would take to complete.
  2. Use Comet's fragment ion indexing option.  I'm worried about suggesting this given that your database is huge (no-cut, peptides 8-25 in length, starting with 370K sequences).  I'm not even sure if 100GB of free RAM would be enough to generate and hold the fragment ion index in memory.  If you're not stuck using Comet, maybe try MSFragger or Sage for this search.
If you want to wait, I'm going to test option 1 above (using a human database with 100K sequences).  Since the search of all possible 8-25 amino acid peptides as separate sequence entries will take a long time, it will be a day or two or three before I report back a comparison of how much faster the asterisk appended approach is.
Reply all
Reply to author
Forward
0 new messages