After spending some time testing "comet", I think the "Comet indexed peptide database" works here. This way, comet will build the database, and redundant peptides were combined. If the database file is large, this can save a lot of time. In fact, in my testing run, the CPU*hour consumption is reduced about 90% if using the "Comet indexed peptide database". The memory consumption remains small. The running time is about the same, as it seems that with "Comet indexed peptide database" the multiple thread does not work. I think it is a good way to save CPU*hours.
However, I do have some concern. Running with the proteins as database, or "Comet indexed peptide database", the results are highly similar, but not identical (I the decoy sequences is added manually):
1) "Comet indexed peptide database" seems not showing all protein ids. The number of protein ids assigned to a peptide is much less than use the protein as database. (max_duplicate_proteins = -1, to include all proteins)
2) "Comet indexed peptide database" may have different "Label" for peptides. Some peptides exist in both target database and decoy database, and will be labeled "1" if using protein sequences as database, but may be labeled as "-1" if using "Comet indexed peptide database". The performance of "Comet indexed peptide database" seems to possibly impacted by the order of decoy and target proteins in the database, which means that when building "Comet indexed peptide database", if the target sequences are ahead of decoy sequences, it is more likely to be labeled as "1".
My plan is: maybe I will do the "Label" will my own codes.
Or maybe do two runs, one with only target database and one with only decoy database? But if so, I don't know the proper way of combing the output for percolator. Hope to get some suggestions. Thank you!