How is comet's xcorr or expect value used in PSM

575 views
Skip to first unread message

David Zhao

unread,
Aug 21, 2014, 6:08:54 PM8/21/14
to come...@googlegroups.com
Hi there,

We are in the process of replacing sequest with multi-threaded comet-ms. Can I use the same discriminant score formula used in Peptideprophet for sequest for comet results, or I can use the expect value directly?
Thanks,

David

Jimmy Eng

unread,
Aug 21, 2014, 6:36:49 PM8/21/14
to come...@googlegroups.com
Comet is natively supported by TPP/PeptideProphet in case you're not aware.  If you look at the discriminant function code in the TPP (CometDiscrimFunction.cxx and SequestDiscrimFunction.cxx), you'll notice the various coefficients are exactly the same between the two.  You can also use Comet's expectation value score for modeling in PeptideProphet too if you prefer; this is invoked with the EXPECTSCORE opton in PeptideProphetParser.

David Zhao

unread,
Aug 21, 2014, 7:54:37 PM8/21/14
to come...@googlegroups.com
Thanks Jimmy. I took a look at the DiscrimFunction codes, and it seems that using expect value in comet is very simple. Which is the "better way" to calculate discriminant scores, multi-variate like sequest or to use expect value? 

David 

Jimmy Eng

unread,
Aug 21, 2014, 8:08:05 PM8/21/14
to David Zhao, come...@googlegroups.com
David,

Those are questions to ask on the TPP forum (spctools-discuss google group).  There are tradeoffs between the two and I don't know enough to give you insight so it's best if you get feedback from the PeptideProphet folks.  In the TPP's xinteract usage statement, it indicates that using expectation value in place of the discriminant function for X!Tandem could be useful for data with homologous top hits like you would get in phospho searches.  If I could only choose one method to apply generally, I would choose the discriminant function over using the expectation value but it's likely not better in all use cases.

- Jimmy

David Zhao

unread,
Aug 22, 2014, 1:41:34 PM8/22/14
to Jimmy Eng, come...@googlegroups.com

Thanks Jimmy for your insights. I will definitely check with TPP folks. We have a homegrown discriminant score formula for Sequest result PTM modeling, so calculating discriminant score for Comet-ms would make more sense. It just that when I read your paper, it seems to me that Xcorr scores are calculated for backward compatibility (maybe I read it wrong), expect value should be a more straightforward measure?

Some other questions for comet: 
1. what's the best way to set up for phspho searches?
2. can we specify in the parameter file to not include terminal residue modification of a given amino acid?
3. how can I set up custom amino acid to represent, say, an acetylated aspartate?

David

Jimmy Eng

unread,
Aug 22, 2014, 2:12:28 PM8/22/14
to come...@googlegroups.com, jke...@gmail.com
On Friday, August 22, 2014 10:41:34 AM UTC-7, David Zhao wrote:

Thanks Jimmy for your insights. I will definitely check with TPP folks. We have a homegrown discriminant score formula for Sequest result PTM modeling, so calculating discriminant score for Comet-ms would make more sense. It just that when I read your paper, it seems to me that Xcorr scores are calculated for backward compatibility (maybe I read it wrong), expect value should be a more straightforward measure?

The xcorr scores are the primary scores calculated for each peptide against every spectrum.  The E-value calculation is based on the distribution of xcorr scores for each spectrum query so the xcorr is definitely not calculated just for backwards compatibility.  fwiw, the preliminary score (Sp) is being calculated for backwards compatibility.

Some other questions for comet: 
1. what's the best way to set up for phspho searches?

There's no one right answer.  You need to choose your mass tolerances, enzyme settings, etc.  The phosho variable mods would be specified as something like:

variable_mod1 = 79.966331 STY 0 3
 
2. can we specify in the parameter file to not include terminal residue modification of a given amino acid?

no such functionality exists in Comet.
 
3. how can I set up custom amino acid to represent, say, an acetylated aspartate?

If you want to have custom amino acids in your database, you can do so with the letters B, J, U, X, Z.  Then just edit the mass of each letter in the parameters file to whatever mass you want, e.g.

   add_B_user_amino_acid = 175.13936

Hopefully this is what you're asking for; if not, please clarify.
 

David

David Zhao

unread,
Sep 11, 2014, 1:51:47 PM9/11/14
to come...@googlegroups.com
Hi Jimmy,

I've done some comparison of sequest and comet results lately, and have some observation I'd like to share with you and see what are your takes on these:
1. I ported the discriminant score formula to perl and java and used it to calculate discriminant score for PTM, below is the plot of our XcorrNorm score (normalized Xcorr) vs Discriminant score by charges, it seems like that discriminant score penalize higher charge state hits? I can see that from the code as well. 



but the expect value correlates better with our XcorrNorm score:

​2. If I use the recommended parameter setting from comet web site for our Velos samples, mainly using 20 ppm, with mono/mono parent and fragment ion mass, I get much fewer hits. As you know, peptides (proteins) in our samples are pulled down by our activity based probes, and the way we run the samples on our instrument, I found that using average parent mass, and monoisotopic fragment mass with 2 Da tolerance give me better results. Do you think this makes sense? and should I set "isotope_error" setting to 1 in this case, I currently setting it to 0.
Does using C ion in the search make any different? we use in sequest, but it seems it's not recommended to use in comet.

3. And the million dollar question, if I need to get as many hits as possible, what will be the best settings? or which settings will make the biggest impact?
Thanks a lot!

David





On Wed, Aug 27, 2014 at 11:39 AM, David Zhao <weizh...@gmail.com> wrote:
Thanks, Jimmy!

David

On Aug 27, 2014, at 11:01 AM, Jimmy Eng <jke...@gmail.com> wrote:

If you look at a Comet generated pep.xml file, there is a "massdiff" attribute for the "search_hit" element.  The value of the "massdiff" attribute is (precursor_neutral_mass - calc_neutral_pep_mass) so experimental mass minus calculated peptide mass.


On Wed, Aug 27, 2014 at 9:54 AM, David Zhao <weizh...@gmail.com> wrote:
Is it the difference between the theoretical and experimental mass?
Thanks

David


On Tue, Aug 26, 2014 at 2:29 PM, David Zhao <weizh...@gmail.com> wrote:
Hi Jimmy,

I'm looking at CometDiscrimFunction.cxx to port the function to Java and perl, and one question: what is massdiff field in CometSearchResult? Where is it in the comet result? BTW, is there a documentation on comet result somewhere?
Thanks,

David

Jimmy Eng

unread,
Sep 11, 2014, 6:05:18 PM9/11/14
to come...@googlegroups.com
see my replies inline below.


On Thursday, September 11, 2014 10:51:47 AM UTC-7, David Zhao wrote:
Hi Jimmy,

I've done some comparison of sequest and comet results lately, and have some observation I'd like to share with you and see what are your takes on these:
1. I ported the discriminant score formula to perl and java and used it to calculate discriminant score for PTM, below is the plot of our XcorrNorm score (normalized Xcorr) vs Discriminant score by charges, it seems like that discriminant score penalize higher charge state hits? I can see that from the code as well. 



but the expect value correlates better with our XcorrNorm score:


I don't know if I can add any useful comment here; I suggest you use what works for you.  The discriminant scores in PeptideProphet don't make use of XcorrNorm and that tool analyzes each charge state separately so it's not obvious to me if it's a good or bad outcome demonstrated in your discriminant analysis vs. XcorrNorm plot above.  I also don't know the behavior of XcorrNorm myself so I don't know how to interpret something correlating well with it.
 
​2. If I use the recommended parameter setting from comet web site for our Velos samples, mainly using 20 ppm, with mono/mono parent and fragment ion mass, I get much fewer hits. As you know, peptides (proteins) in our samples are pulled down by our activity based probes, and the way we run the samples on our instrument, I found that using average parent mass, and monoisotopic fragment mass with 2 Da tolerance give me better results. Do you think this makes sense? and should I set "isotope_error" setting to 1 in this case, I currently setting it to 0.

What recommended parameter setting are you using with your Velos sample?  Just a pure Velos instrument should use the comet.params.low-low because there is no high-res data in either the MS or MS/MS scans.  If it's really an Orbi-Velos instrument then I would be surprised that a 2 Da, avg mass setting works better than the 20 ppm mono mass setting with "isotope_error = 1".

With a 2 Da tolerance, set "isotope_error = 0".
 
Does using C ion in the search make any different? we use in sequest, but it seems it's not recommended to use in comet.

I don't use it but that doesn't mean that it couldn't make a small positive difference.  Search a couple of datasets using a target-decoy search with and without specifying C-ions and see what gives you more IDs at a given FDR.  It would be helpful if you report your findings back here.

3. And the million dollar question, if I need to get as many hits as possible, what will be the best settings? or which settings will make the biggest impact?
Thanks a lot!

That is a million dollar question that I can't answer and I'm sure there's no one right answer for all use cases.  I already have suggestions for best parameter settings for various combinations of low res and high res spectra.  But how you process your data post-search has an impact as well.

If I had the time, I'd vary:
- precursor tolerances
- fragment tolerances
- full vs. semi-digest
- likely always use mono masses
- compare performance of all combinations above using plain target-decoy FDR analysis based on the E-value
- compare performance of all combinations above using tools like Percolator and PeptideProphet
- repeat for multiple datasets acquired in various ways

I haven't done this analysis though which is why I can't really answer your question (not that there would be one right answer for your question).  Until I get more insight, all I can do is suggest you use the suggested parameters (comet.params.high-high, high-low, and low-low) for your data.  Or at least use those as starting points for yourself to determine what works best for you.

- Jimmy 

David Zhao

unread,
Sep 11, 2014, 7:11:59 PM9/11/14
to come...@googlegroups.com
Thanks Jimmy, and please see my comments below, and thank you for your time!


On Thursday, September 11, 2014 3:05:18 PM UTC-7, Jimmy Eng wrote:
see my replies inline below.

On Thursday, September 11, 2014 10:51:47 AM UTC-7, David Zhao wrote:
Hi Jimmy,

I've done some comparison of sequest and comet results lately, and have some observation I'd like to share with you and see what are your takes on these:
1. I ported the discriminant score formula to perl and java and used it to calculate discriminant score for PTM, below is the plot of our XcorrNorm score (normalized Xcorr) vs Discriminant score by charges, it seems like that discriminant score penalize higher charge state hits? I can see that from the code as well. 



but the expect value correlates better with our XcorrNorm score:


I don't know if I can add any useful comment here; I suggest you use what works for you.  The discriminant scores in PeptideProphet don't make use of XcorrNorm and that tool analyzes each charge state separately so it's not obvious to me if it's a good or bad outcome demonstrated in your discriminant analysis vs. XcorrNorm plot above.  I also don't know the behavior of XcorrNorm myself so I don't know how to interpret something correlating well with it.
My finding here is that either our own XcorrNorm or log transformed expect value work for me, and not much difference in FDR. 
 
​2. If I use the recommended parameter setting from comet web site for our Velos samples, mainly using 20 ppm, with mono/mono parent and fragment ion mass, I get much fewer hits. As you know, peptides (proteins) in our samples are pulled down by our activity based probes, and the way we run the samples on our instrument, I found that using average parent mass, and monoisotopic fragment mass with 2 Da tolerance give me better results. Do you think this makes sense? and should I set "isotope_error" setting to 1 in this case, I currently setting it to 0.

What recommended parameter setting are you using with your Velos sample?  Just a pure Velos instrument should use the comet.params.low-low because there is no high-res data in either the MS or MS/MS scans.  If it's really an Orbi-Velos instrument then I would be surprised that a 2 Da, avg mass setting works better than the 20 ppm mono mass setting with "isotope_error = 1".
With a 2 Da tolerance, set "isotope_error = 0".
You're right, I just realized that we have LTQ Velos and LTQ Velos PRO here, so I should use low-low params, that's why 2 Da mass tolerance with isotope_error = 0 worked better. The difference between my settings with low-low is:
I set peptide_mass_tolerance to 2.0 instead of 3.0
and 
mass_type_parent set to 0 instead of 1, I will give the low-low setting a try then and report back.


 
Does using C ion in the search make any different? we use in sequest, but it seems it's not recommended to use in comet.

I don't use it but that doesn't mean that it couldn't make a small positive difference.  Search a couple of datasets using a target-decoy search with and without specifying C-ions and see what gives you more IDs at a given FDR.  It would be helpful if you report your findings back here.

I didn't see much of the difference with or without search c ions, either way is fine I think



3. And the million dollar question, if I need to get as many hits as possible, what will be the best settings? or which settings will make the biggest impact?
Thanks a lot!

That is a million dollar question that I can't answer and I'm sure there's no one right answer for all use cases.  I already have suggestions for best parameter settings for various combinations of low res and high res spectra.  But how you process your data post-search has an impact as well.

If I had the time, I'd vary:
- precursor tolerances
I will try 3 Da, 
- fragment tolerances
Should I increase or decrease the fragment tolerance? our sequest settings is actually set to 0.0000
- full vs. semi-digest
Allowed missed cleavage is currently set to 3 instead of suggested 2 (the same as our sequest settings)/
- likely always use mono masses
I will try this, only tried 2 Da mono/mono with isotope_error =1, I guess this is not a sound combination. 
- compare performance of all combinations above using plain target-decoy FDR analysis based on the E-value
Using E-value vs our XcorrNorm yield similar results.  
Reply all
Reply to author
Forward
0 new messages