Early exit crux tide-search

43 views
Skip to first unread message

Philip Remes

unread,
Nov 25, 2019, 1:16:49 AM11/25/19
to crux-users
Hello,

I'm new to Crux, and have been trying to analyze some raw files that were previously searched with Sequest.  I'm having some trouble though, in that the program just stops in the middle of processing.  In this case it was at 16%.  These data are 1 Da isolation width DIA data with nominal mass accuracy MS2 analysis.  I've set the Charge State in the spectrum header to 0, so that in Sequest in Proteome Discoverer the spectra would be searched as 2 and 3+.  This may be relevant, because I noticed a warning message about the Spectrum 2 not having a charge state.  I've attached the log and the parameters.  Any ideas?  I could send you a link to a raw file as well if you send me an email address, if this will help to debug the issue.

WARNING: Spectrum 2 has no charge state. Calculating charge

Thanks
Philip
pipeline.log.txt
pipeline.params.txt

William S Noble

unread,
Nov 25, 2019, 10:24:09 AM11/25/19
to Philip Remes, crux-users
Hi Philip,

I suspect that you have run into the threading bug that we are currently trying to track down. Can you try re-running your command with "--num-threads 1"?

Bill


--
You received this message because you are subscribed to the Google Groups "crux-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to crux-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/crux-users/01aef6ba-796b-4c90-b00a-60b511edaa9b%40googlegroups.com.

Philip Remes

unread,
Nov 25, 2019, 1:59:08 PM11/25/19
to crux-users
Hi Bill,

Thanks for the fast reply.  Yes indeed setting to one thread allowed the search to complete.  Amazingly fast!  About 3x faster than Sequest-HT with Proteome Discoverer.  Now I'm trying to compare to results I'd gotten with these files using Sequest+Percolator in PD and looking for why the number of unique peptides is smaller with Tide+Percolator.  I suspect there is probably some parameter that I don't have set the same between the two analyses and this is where I'll investigate next, but there were a couple of messages that drew my attention in the log.  A message says that a large number of peaks were deleted, although they are apparently a small fraction of the total.  Is there a way to know in which ways the peaks were out of bounds, or is this a typical number of deleted peaks?  Additionally the number of candidates per spectrum-charge combination is negative, that also made me question if something was not quite correct.

INFO: [Thread 0]: Deleted 0 precursor, 0 isotope and 7817920 out-of-range peaks.
INFO: [Thread 0]: Retained 97.1706% of peaks.
INFO: Time per spectrum-charge combination: 0.004389 s.
INFO: Average number of candidates per spectrum-charge combination: -1932.151245 

Thanks
Philip


On Monday, November 25, 2019 at 7:24:09 AM UTC-8, Bill Noble wrote:
Hi Philip,

I suspect that you have run into the threading bug that we are currently trying to track down. Can you try re-running your command with "--num-threads 1"?

Bill


On Sun, Nov 24, 2019 at 10:17 PM Philip Remes <phili...@gmail.com> wrote:
Hello,

I'm new to Crux, and have been trying to analyze some raw files that were previously searched with Sequest.  I'm having some trouble though, in that the program just stops in the middle of processing.  In this case it was at 16%.  These data are 1 Da isolation width DIA data with nominal mass accuracy MS2 analysis.  I've set the Charge State in the spectrum header to 0, so that in Sequest in Proteome Discoverer the spectra would be searched as 2 and 3+.  This may be relevant, because I noticed a warning message about the Spectrum 2 not having a charge state.  I've attached the log and the parameters.  Any ideas?  I could send you a link to a raw file as well if you send me an email address, if this will help to debug the issue.

WARNING: Spectrum 2 has no charge state. Calculating charge

Thanks
Philip

--
You received this message because you are subscribed to the Google Groups "crux-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to crux-...@googlegroups.com.
pipeline.params.txt
pipeline.log.txt

William S Noble

unread,
Nov 25, 2019, 2:16:20 PM11/25/19
to Philip Remes, crux-users
I'm glad to hear that it is running now. I would not worry about the out-of-range peaks. That is not an unusual number. The negative value for the number of candidates per scan is definitely weird.  But the problem is almost certainly in the bookkeeping for producing that output, when we aggregate the counts over each spectrum in the data set.  I think the difference in performance you are seeing likely relates to some other discrepancy in how the index is constructed or the search is run.  You may get better (albeit markedly slower) results if you turn on the exact-p-value option or if you switch the score-function option to "both."

Bill


To unsubscribe from this group and stop receiving emails from it, send an email to crux-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/crux-users/68653a85-fd8a-4299-8cb7-9c700bd31282%40googlegroups.com.

Philip Remes

unread,
Dec 20, 2019, 1:36:07 AM12/20/19
to crux-users

I’ve continued to look into the differences between analyses of a data set with Crux\Tide+Percolator compared to ProteomeDiscoverer\SequestHT+Percolator, from an experiment with 1 Da isolation width DIA data.  I focused first on some of the Tide parameters like using exact-p-value as you suggested, mz fragment bin size, precursor tolerance, and making sure the modifications for the tide-index are correct (just the ol’ dynamic M+16 and static C+57).  Pretty sure that these things are okay, I started looking into the differences in XCorrs and Percolator results.  One observation is that for these data the Crux Percolator used 12 SVM parameters, and pretty much discriminated true and false positives just with XCorr: or that’s the conclusion I drew from the SVM weights and by plotting Q-Value versus XCorr.  The PD Percolator used 33 SVM parameters, and put more emphasis on other things than XCorr which aren’t in the Crux\Percolator, like Fragment coverage and Longest Y Sequence Series.  Another observation is that the XCorr values for the same peptide in the same scan are much different.  The values correlate pretty well, but there’s a lot of negative Tide XCorrs, suggesting that something has is not the same between the XCorr implementations of Tide and Sequest-HT, although it could be just a translation that has no real effect.  I hadn’t known that XCorr could be negative, but see that this could certainly happen with large fragment bins, if I’m interpreting Jimmy’s 2008 XCorr paper correctly.  Anyways the result of all this is that if you plot the q-values or PEP for peptides in the same spectra, there’s little correspondence between the outputs from the two pipelines.

 

What I’m wondering is if maybe there are aspects of the PD pipeline that are more favorable to nominal mass accuracy/resolution MS2 spectra when the isolation window is not centered on the precursor peptide, and if there's a way to make Crux perform similarly.  I’ve observed similar things for other search engines, for example Byonic has trouble identifying peptides in these files, but performs comparably to Sequest for DDA spectra, when the precursor is centered on the peak.  MSFragger crashes when these files are used, so there appears to be something a little unexpected about them.  I’ve not yet been able to finish a Crux\Comet + Percolator search, because I tend to get an error about a bad allocation, and also my computer uses so many resources during the analysis that I can’t do anything else with it, even with Comet num_threads = 1.  Do you have any thoughts about these observations, or ideas about ways to optimize the Tide+Percolator results for this kind of data?  I didn’t see on the Crux\Percolator page any parameters about the SVM features to be used.  Perhaps there is a way to use the same ones as PD\Percolator?

 

Thanks

Philip


On Monday, November 25, 2019 at 11:16:20 AM UTC-8, Bill Noble wrote:
I'm glad to hear that it is running now. I would not worry about the out-of-range peaks. That is not an unusual number. The negative value for the number of candidates per scan is definitely weird.  But the problem is almost certainly in the bookkeeping for producing that output, when we aggregate the counts over each spectrum in the data set.  I think the difference in performance you are seeing likely relates to some other discrepancy in how the index is constructed or the search is run.  You may get better (albeit markedly slower) results if you turn on the exact-p-value option or if you switch the score-function option to "both."

Bill


191204_tide_and_sequest.pptx

William S Noble

unread,
Dec 20, 2019, 12:00:29 PM12/20/19
to Philip Remes, crux-users, Lukas Käll
Hi Philip,

Thanks for your comments.  My responses are interleaved below.

On Thu, Dec 19, 2019 at 10:36 PM Philip Remes <philip...@gmail.com> wrote:

I’ve continued to look into the differences between analyses of a data set with Crux\Tide+Percolator compared to ProteomeDiscoverer\SequestHT+Percolator, from an experiment with 1 Da isolation width DIA data.  I focused first on some of the Tide parameters like using exact-p-value as you suggested, mz fragment bin size, precursor tolerance, and making sure the modifications for the tide-index are correct (just the ol’ dynamic M+16 and static C+57).  Pretty sure that these things are okay, I started looking into the differences in XCorrs and Percolator results.  One observation is that for these data the Crux Percolator used 12 SVM parameters, and pretty much discriminated true and false positives just with XCorr: or that’s the conclusion I drew from the SVM weights and by plotting Q-Value versus XCorr.  The PD Percolator used 33 SVM parameters, and put more emphasis on other things than XCorr which aren’t in the Crux\Percolator, like Fragment coverage and Longest Y Sequence Series. 


This is interesting.  I am cc'ing Lukas Kall so he can comment. It certainly seems like it might be worthwhile for us to compute some of these additional features if they are highly discriminative.
 

Another observation is that the XCorr values for the same peptide in the same scan are much different.  The values correlate pretty well, but there’s a lot of negative Tide XCorrs, suggesting that something has is not the same between the XCorr implementations of Tide and Sequest-HT, although it could be just a translation that has no real effect.  I hadn’t known that XCorr could be negative, but see that this could certainly happen with large fragment bins, if I’m interpreting Jimmy’s 2008 XCorr paper correctly.  Anyways the result of all this is that if you plot the q-values or PEP for peptides in the same spectra, there’s little correspondence between the outputs from the two pipelines.


In Jimmy's original SEQUEST implementation, the XCorr calculation had a simple truncation at zero, so if the XCorr was going to be negative it just got set to zero.  As you point out, there is no conceptual reason why XCorr can't go negative, though in practice good matches are essentially always positive.  That said, there are lots of other reasons why XCorrs might differ, having to do with choices made in various steps of the spectrum pre-processing.  You can see these steps using the print-process-spectra command in Crux:


But I doubt there is something similar for Sequest-HT.
 

 

What I’m wondering is if maybe there are aspects of the PD pipeline that are more favorable to nominal mass accuracy/resolution MS2 spectra when the isolation window is not centered on the precursor peptide, and if there's a way to make Crux perform similarly. 


Can you clarify what you mean by "nominal" here?
 

I’ve observed similar things for other search engines, for example Byonic has trouble identifying peptides in these files, but performs comparably to Sequest for DDA spectra, when the precursor is centered on the peak.  MSFragger crashes when these files are used, so there appears to be something a little unexpected about them.  I’ve not yet been able to finish a Crux\Comet + Percolator search, because I tend to get an error about a bad allocation, and also my computer uses so many resources during the analysis that I can’t do anything else with it, even with Comet num_threads = 1.  Do you have any thoughts about these observations, or ideas about ways to optimize the Tide+Percolator results for this kind of data? 


Are you using a very large peptide database?  And if so, are you running tide-search on an index created by tide-index?  This part of the pipeline has a large memory footprint (which we are working on reducing). If you can create your index once and store it on disk, then you should get much better performance for the downstream steps.
 

I didn’t see on the Crux\Percolator page any parameters about the SVM features to be used.  Perhaps there is a way to use the same ones as PD\Percolator?

 


Are there specific parameters that you have in mind? I was under the impression that we support all of the parameters in the stand-alone version of Percolator at http://percolator.ms, but maybe we missed some.
 
Bill


To unsubscribe from this group and stop receiving emails from it, send an email to crux-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/crux-users/94a0f92c-2b7f-4a7c-a2aa-45e22a444f77%40googlegroups.com.

Philip Remes

unread,
Dec 20, 2019, 5:55:25 PM12/20/19
to crux-users
Hi Bill

Thanks for you reply.  

Q: Can you clarify what you mean by "nominal" here?
A: ~0.7 m/z FWHM, +/- ~0.2 m/z mass accuracy.  The data are not centered on any precursor, but are DIA data with 1 Da isolation width.

Q: Are you using a very large peptide database?  And if so, are you running tide-search on an index created by tide-index?
A: Yes the entire human fasta is used.  For Tide I used your indexing pre-processing step, and then it all runs pretty fast.  Jimmy gave me a hint about setting the spectrum batch size to 10000 which seemed to help for Comet as far as the program not grabbing all the computer resources, and now I have some results from Comet+Percolator now too in the attached slides.

Q: Are there specific parameters that you have in mind? I was under the impression that we support all of the parameters in the stand-alone version of Percolator at http://percolator.ms, but maybe we missed some.
A: I notice now when I ran the Comet+Percolator pipeline that there are 16 Percolator parameters instead of the 12 used for the Tide+Percolator pipeline, including deltaCn which appears to be very discriminatory.  Comet+Percolator produces almost as many unique peptide identifications as PD, although there are still half as many PSMs.  As nearly as I can tell, the pipeline parameters I used for Comet and Tide are as similar as could be, so it appears that when Comet is the search engine, there are a 4 more features that are used.  On slide 4 of the power point I uploaded, you can see the additional features that PD used (33 total), some of which seemed to be significant.  I couldn't tell from the percolator.ms website or on yours how to specify the features that get used though.  Is that supported?  It could be interesting to try out some of these additional features with the data produced from Tide and Comet.

There's some additional kind of interesting observation that the Comet and Tide Xcorrs aren't exactly the same, but this may not be of real significance.

Best,
Philip

Jimmy gave me a little pointer about the spectrum batch size, and so I have Comet+Percolator results for the same data set.  The number of unique peptides is quite a bit higher than the Tide result.


On Friday, December 20, 2019 at 9:00:29 AM UTC-8, Bill Noble wrote:
Hi Philip,

Thanks for your comments.  My responses are interleaved below.
191220_tide_sequest_comet.pptx
comet.params.txt
tide-params.txt

William S Noble

unread,
Dec 21, 2019, 11:00:03 AM12/21/19
to Philip Remes, crux-users
On Fri, Dec 20, 2019 at 2:56 PM Philip Remes <philip...@gmail.com> wrote:
Hi Bill

Thanks for you reply.  

Q: Can you clarify what you mean by "nominal" here?
A: ~0.7 m/z FWHM, +/- ~0.2 m/z mass accuracy.  The data are not centered on any precursor, but are DIA data with 1 Da isolation width.


OK, thanks for the clarification.
 
Q: Are you using a very large peptide database?  And if so, are you running tide-search on an index created by tide-index?
A: Yes the entire human fasta is used.  For Tide I used your indexing pre-processing step, and then it all runs pretty fast.  Jimmy gave me a hint about setting the spectrum batch size to 10000 which seemed to help for Comet as far as the program not grabbing all the computer resources, and now I have some results from Comet+Percolator now too in the attached slides.


Ah, I would not call that "very large" -- that's a fairly standard database.  But I'm glad it's working for you.
 
Q: Are there specific parameters that you have in mind? I was under the impression that we support all of the parameters in the stand-alone version of Percolator at http://percolator.ms, but maybe we missed some.
A: I notice now when I ran the Comet+Percolator pipeline that there are 16 Percolator parameters instead of the 12 used for the Tide+Percolator pipeline, including deltaCn which appears to be very discriminatory.  Comet+Percolator produces almost as many unique peptide identifications as PD, although there are still half as many PSMs.  As nearly as I can tell, the pipeline parameters I used for Comet and Tide are as similar as could be, so it appears that when Comet is the search engine, there are a 4 more features that are used.  On slide 4 of the power point I uploaded, you can see the additional features that PD used (33 total), some of which seemed to be significant.  I couldn't tell from the percolator.ms website or on yours how to specify the features that get used though.  Is that supported?  It could be interesting to try out some of these additional features with the data produced from Tide and Comet.


OK, this is a terminology mismatch.  When you said "parameter" I thought you meant command line options.  What you are calling "parameters" I usually refer to as "features" -- the input features to Percolator.

We removed deltaCn from the Tide pipeline because, in principle, deltaCn violates the assumptions of the FDR control procedure (by taking into account properties of multiple PSMs).  Especially in conjunction with a more powerful machine learning classifier, this type of feature can lead to bias.  But for a linear SVM like in Percolator, it probably is not so much of a problem.  And you are right that that feature is typically very discriminative.
 
There's some additional kind of interesting observation that the Comet and Tide Xcorrs aren't exactly the same, but this may not be of real significance.


FWIW, you can see how to make Tide and Comet's XCorr agree from the first FAQ here:


image.png
Bill

 
To unsubscribe from this group and stop receiving emails from it, send an email to crux-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/crux-users/3ae57e0c-d85f-46ec-84e7-934837d91d14%40googlegroups.com.

Philip Remes

unread,
Dec 21, 2019, 11:55:20 AM12/21/19
to William S Noble, crux-users
I’m new to the field so don’t have the right terminology.  So did you ever consider the other features that the PD Percolator uses like Fragment Coverage %, Log Matched Fragment Intensities, DeltaMass Fragments, and Longest Sequence Series?  At least for my data set these were some of the ones that had a lot of weight.  They would also seem to be derived from a single PSM.

Best
Philip

William S Noble

unread,
Dec 23, 2019, 12:08:45 PM12/23/19
to Philip Remes, crux-users
No, we have not considered these other features.  Not for lack of interest -- it's just something we have not had the time to do.  Since many of these require access to the spectrum as well as the peptide, these features would need to be calculated by tide-search rather than make-pin.  It would be relatively straightforward to add these features to the output files produced by tide-search and then modify make-pin to parse those features and include them in the Percolator input file. If this is something you are interested in trying, we could give pointers for where in the code changes should be made.  But right now we are focused on some bug fixes rather than adding new functionality.

Bill

On Sat, Dec 21, 2019 at 2:23 PM William Stafford Noble <wno...@uw.edu> wrote:
No, we have not considered these other features.  Not for lack of interest -- it's just something we have not had the time to do.  Since many of these require access to the spectrum as well as the peptide, these features would need to be calculated by tide-search rather than make-pin.  It would be relatively straightforward to add these features to the output files produced by tide-search and then modify make-pin to parse those features and include them in the Percolator input file. If this is something you are interested in trying, we could give pointers for where in the code changes should be made.  But right now we are focused on some bug fixes rather than adding new functionality.

Bill

Reply all
Reply to author
Forward
0 new messages