spectral-count issues regarding decoy database and protein quantification

44 views
Skip to first unread message

esv...@gmail.com

unread,
Sep 30, 2020, 3:07:05 PM9/30/20
to crux-users

Hey Bill,

I've been experimenting with your spectral count features, the NSAF function specifically, and I've notice a few issues that make it problematic for use.

Firstly, the quantification will fail if there are decoy PSMs listed under your FDR of choice in the .percolator.target.psms file. Every returned quantity will be zero because the decoy_sp|XXXX will not be in the fasta file database. This might not be the case with a tide search, I'm not sure if the generated database there would be more useable. I, however, was using this through a comet search. So my original database file doesn't have the decoys. Not that they need to be quantified anyway, perhaps ignoring them when running spectral counts would be an ideal implementation?

Secondly, when opting to quantify proteins, results yield a rather different list of proteins compared to what might be validated via fido. Looking at the equation you use to calculate this, it seems pretty simple to do. I'm planning on making my own script to run through my fido-validated proteins rather than using spectral-counts. Is there any reason you quantify proteins based on the percolator.target.psms file? I would think that quantifying peptides from the psm file based on the target.protein matches would be the ideal end result.

If you need an example file you can just download:
20150513_QEP2_FBU_AB_3D_set1_8h_03.raw file from the project PXD005713 on the PRIDE database if you need an example file. This is Drosophila data, found here:

Thanks for your help, and thanks for making this toolkit!

William S Noble

unread,
Oct 8, 2020, 5:47:47 PM10/8/20
to esv...@gmail.com, crux-users
On Wed, Sep 30, 2020 at 12:09 PM esv...@gmail.com <esv...@gmail.com> wrote:

Hey Bill,

I've been experimenting with your spectral count features, the NSAF function specifically, and I've notice a few issues that make it problematic for use.

Firstly, the quantification will fail if there are decoy PSMs listed under your FDR of choice in the .percolator.target.psms file.

Hmm, this is weird, since by definition the .percolator.target.psms file should only contain target PSMs. Isn't there a separate decoy.psms file?
 
Every returned quantity will be zero because the decoy_sp|XXXX will not be in the fasta file database. This might not be the case with a tide search, I'm not sure if the generated database there would be more useable. I, however, was using this through a comet search. So my original database file doesn't have the decoys. Not that they need to be quantified anyway, perhaps ignoring them when running spectral counts would be an ideal implementation?


Yes, that seems sensible.  In general, reasonable behavior might be to simply not report quantifications for proteins that cannot be found.
 
Secondly, when opting to quantify proteins, results yield a rather different list of proteins compared to what might be validated via fido. Looking at the equation you use to calculate this, it seems pretty simple to do. I'm planning on making my own script to run through my fido-validated proteins rather than using spectral-counts. Is there any reason you quantify proteins based on the percolator.target.psms file? I would think that quantifying peptides from the psm file based on the target.protein matches would be the ideal end result.


Yes, this is also a good idea. We would need to look into whether this is straightforward to implement. Have you verified in your case that spectral-counts does not work with the percolator.target.protein.txt file as input?

Bill



If you need an example file you can just download:
20150513_QEP2_FBU_AB_3D_set1_8h_03.raw file from the project PXD005713 on the PRIDE database if you need an example file. This is Drosophila data, found here:

Thanks for your help, and thanks for making this toolkit!

--
You received this message because you are subscribed to the Google Groups "crux-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to crux-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/crux-users/6197b10f-6370-4d15-94a0-0ace4aa1b6e0n%40googlegroups.com.

esv...@gmail.com

unread,
Oct 19, 2020, 3:24:27 PM10/19/20
to crux-users
Hey Bill,

Sorry for the late response.

I do have a separate .decoy.psms file, although some decoys seem to make it through validation for me.

Running spectral-counts on the .target.proteins file does  not work, it throws a FATAL: No peptide sequence (). Although you can run this on .target.peptides.

Also, one thing I forgot to include in the original post is that if I run --unique-mapping True, then I'll get a Segmentation fault (core dumped) message. Only the False option finishes successfully.

William S Noble

unread,
Dec 21, 2020, 2:26:53 PM12/21/20
to esv...@gmail.com, crux-users
Sorry that this issue has languished in my inbox for so long. I would like to see if I can track this down and get it sorted out. Would you be willing to provide an input file and a command line that illustrates the problem, so I can try to reproduce it on our end?

Thanks.
Bill

esv...@gmail.com

unread,
Jan 30, 2021, 2:54:41 PMJan 30
to crux-users
Hey Bill,

I specify a pride database/file at the bottom of the first post (which file didn't seem to make the difference for me). The drosophila database you can grab off of uniprot.

Please excuse the lack of formatting, I don't see any way to add a code highlight in this reply box.

From the logs:
C:/stuff/crux-3.2.d330b2f.Windows.AMD64/crux-3.2.Windows.AMD64/bin/crux.exe comet --decoy_search 1 --peptide_mass_units 2 --auto_peptide_mass_tolerance warn --precursor_tolerance_type 1 --auto_fragment_bin_tol warn --clip_nterm_methionine 1 --allowed_missed_cleavage 2 --add_C_cysteine 71.037114 --variable_mod01 79.966331 STY 1 2 -1 0 0 --variable_mod02 15.994915 MY 1 2 -1 0 --overwrite T --output-dir C:/stuff/drosophila/PXD005713/crux-output --fileroot 20150513_QEP2_FBU_AB_3D_set1_8h_03 C:/stuff/drosophila/PXD005713/MGFs/20150513_QEP2_FBU_AB_3D_set1_8h_03.mgf C:/stuff/fastas/Fly_Drosophila_melanogaster.fasta

C:/stuff/crux-3.2.d330b2f.Windows.AMD64/crux-3.2.Windows.AMD64/bin/crux.exe percolator --protein T --fido-empirical-protein-q T --search-input concatenated --test-each-iteration T --overwrite T --klammer T --output-dir C:/stuff/drosophila/PXD005713/crux-output --fileroot 20150513_QEP2_FBU_AB_3D_set1_8h_03 C:/stuff/drosophila/PXD005713/crux-output/20150513_QEP2_FBU_AB_3D_set1_8h_03.comet.target.txt

I don't still have logs of how I ran the spectral-counts, but this is the xonsh code I was using:

crux spectral-counts --parsimony simple --threshold 0.05 --unique-mapping F --quant-level protein --protein-database @(dbfile) --output-dir @(outdir) --fileroot @(nroot) --overwrite T @(spath)
Just plug in some paths where appropriate.

The searches/validation I had ran on windows, as is obvious from above. But the spectral-counts were done on linux, I don't believe I tested if that made a difference.

William S Noble

unread,
Feb 3, 2021, 12:56:58 AMFeb 3
to esv...@gmail.com, crux-users
OK, I have replicated this problem and figured out the source. The mystery was why the percolator.target.psms.txt file contains references to decoy proteins.  This file is supposed to contain only the targets.  The answer is that some peptide sequences occur in both.  E.g., the sequence ISILK appears in these proteins:

sp|Q9VIM5|ARPC2_DROME
tr|Q8IPA8|Q8IPA8_DROME
tr|X2JAI2|X2JAI2_DROME
tr|Q8IPA5|Q8IPA5_DROME
decoy_tr|Q8IPA5|Q8IPA5_DROME
decoy_tr|Q8IPA8|Q8IPA8_DROME
decoy_tr|X2JAI2|X2JAI2_DROME
decoy_tr|Q8IPA5|Q8IPA5_DROME

It is a bit strange that comet reports peptides that are both targets and decoys, but I suppose this is unavoidable when you use reversal to generate decoys.  It's also strange that Percolator reports these decoy IDs in the target file.  I will raise this as an issue with the Percolator authors.  However, it seems clear that spectral-counts should be smarter about this.  The error message it raises is

WARNING: Need protein sequence in order to calculate protein length.
   Please provide protein fasta or index using the protein-database parameter
   Protein decoy_tr|X2JLN4|X2JLN4_DROME doesn't have the full sequence

The program should notice that, in fact, a protein database was provided, and it should simply skip over proteins that do not appear in the given database.  We'll work on fixing this.

Bill



esv...@gmail.com

unread,
Feb 4, 2021, 8:26:06 PMFeb 4
to crux-users
Thanks for your work, Bill!

Also, I didn't realize comet was using reversed decoys. You should probably add that to the documentation.
Reply all
Reply to author
Forward
0 new messages