Runtime Discrepancy in Comet Searches: Unspecific vs. Pre-digested Database

沈丹青

unread,

Dec 15, 2025, 3:12:12 AM12/15/25

to Comet ms/ms db search support

Dear Jimmy Eng,

I have encountered some questions while using Comet for immunopeptidomics database searching.

I performed two separate search runs. For the first search, the reference database contained approximately 370,000 sequences (denoted as DB1), each around 50 amino acids in length, and I set the parameter search_enzyme_number = 0 to allow unspecific cleavage. For the second search, I used a reference database derived from DB1, consisting of all possible 8–25 amino acid peptides generated by in silico unspecific cleavage of the sequences in DB1 (totaling around 100 million entries), and set search_enzyme_number = 11 for "no cut." All other parameters were identical between the two runs.

The first search completed in about 4 hours, whereas the second took approximately 30 hours. Based on my understanding, the second search essentially manually pre-digests the database in an unspecific manner and should therefore be comparable to the first search in principle. I am thus curious why there is such a significant difference in runtime. Does Comet internally employ search-space reduction algorithms—such as an “optimized sliding window approach”—to improve efficiency when search_enzyme_number = 0 is set?

If I only have access to a pre-digested peptide database, is it feasible to directly use the “no cut” setting? If so, what modifications could be made to reduce computational time? Alternatively, do you have any other suggestions for improving search efficiency under such conditions?

Thank you for your time and support!

Best regards,
Danqing Shen

Jimmy Eng

unread,

Dec 16, 2025, 1:10:32 PM12/16/25

to Comet ms/ms db search support

Danqing,

I believe the difference in search times that your observing is primarily due to how Comet is coded where a search function is invoked on each database sequence. And if you made each of your all possible 8-25 amino acid peptides their own separate FASTA sequence entry, that huge number of individual sequence entries (composed of a single peptide each) requires invoking a huge number of separate search calls (with associated overhead), is likely the reason for the slower search time. Two suggestions to try:

When generating your all possible 8-25 amino acid peptides, append those peptides together separated by the "*" (asterisk) character. This allows you to generate your own peptides and use the "no cut" enzyme option without each peptide needing to be its own sequence entry. Each FASTA sequence would look something like "HEIEYLTK*QLDTLRLV*NEQHAKVY*EQLDLTARDLE*LTNQRLVMES*KAAQQKIHGL*TETIE ...". I haven't tested this myself so you might wan to break each sequence when it hits 100K or 200K in length. The larger each sequence length is the better (more efficient). For searches like this, it's also more efficient if your "spectrum_batch_size" is as large as your free memory allows. Then run a "no cut" search against this FASTA. This search should be faster than ~30 hours but I wouldn't be able to guess how much faster it would take to complete.
Use Comet's fragment ion indexing option. I'm worried about suggesting this given that your database is huge (no-cut, peptides 8-25 in length, starting with 370K sequences). I'm not even sure if 100GB of free RAM would be enough to generate and hold the fragment ion index in memory. If you're not stuck using Comet, maybe try MSFragger or Sage for this search.

If you want to wait, I'm going to test option 1 above (using a human database with 100K sequences). Since the search of all possible 8-25 amino acid peptides as separate sequence entries will take a long time, it will be a day or two or three before I report back a comparison of how much faster the asterisk appended approach is.

Danqing Shen

unread,

Dec 17, 2025, 7:56:11 AM12/17/25

to Comet ms/ms db search support

Dear Jimmy Eng,

Thank you for the helpful explanation and suggestions.

We'd be very interested in seeing the results of Option 1 when you complete your tests, as this could greatly improve our workflow efficiency.

Please share your findings when convenient. We appreciate your support.

Best regards,
Danqing

Jimmy Eng

unread,

Dec 17, 2025, 12:01:31 PM12/17/25

to Comet ms/ms db search support

I wasn't patient enough to wait tens of hours to run tests on a large database so I tested with 20K human sequences and a spectral file containing 5K ms/ms spectra. The direct no enzyme search against the original FASTA took 45 seconds. Searching a database of all 8-25 length peptides using the "no cut" option took 20m47s when each peptide was its own FASTA entry. When the 8-25 length peptides were appended to each other separated by an asterisk (targeting an appended sequence lengths of ~200K), the search took 1m48s.

Reply all

Reply to author

Forward