Replication of plots from exact p-value paper

26 views
Skip to first unread message

Jordan Force

unread,
Feb 20, 2020, 9:27:17 AM2/20/20
to crux-users
Hi everyone,

This question is more for Bill Noble and Jeffry Howbert, but I thought it would be reasonable to ask it on the mailing list. I'm trying to replicate the results of figure 4B in the paper about computing p-values with XCorr (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4159662/pdf/zjw2467.pdf). There, they plot p-value against p-value rank for 10,000 random decoy PSMs from the yeast dataset, and find that the p-values are not uniformly distributed when identical abundance estimates are used at all peptide positions. However, I'm getting pretty different results -- the p-values I get are fairly uniform. I can go into more detail about how exactly I tried to replicate, but this is my strategy:

  1. Use the generate-peptides utility to digest the yeast proteome with a trypsin/p setting, and "--keep-terminal-aminos none" so that the terminal positions get shuffled when creating the decoys
  2. Convert the generate-peptides.decoy.txt file to FASTA, and call tide-index to create an index from it. I use the "--decoy-format none", since we don't need to make a second set of decoys here. Since I don't need to digest again, I use "--custom-enzyme '[Z]|[Z]' --enzyme custom-enzyme".
  3. Search the yeast MGF file using "--exact-p-value T --top-match 10000000", so that all matches would be outputted for each scan
  4. Use a script I wrote to pick a random match for each spectra in the output generated above, and plot the log p-value vs log p-value rank

This is the plot I get:

pvals.png


This is much closer to being uniform than figure 4B in the paper. What am I doing differently? I recognize that this paper is a few years old, so some of the details may have been lost to time. However, any advice you could provide would be extremely helpful. 


By the way, for the second step, the tide-index documentation says to use '{X}|{X}' to prevent digestion. I tried this, but I get "FATAL: No target sequences generated. Is '...' a FASTA file?"; using '[Z]|[Z]' doesn't cause this error, and it prevents digestion (based on the tide-index.peptides.targe.txt file created by tide-index). I'm using crux 3.1.


Thanks,


Jordan

William S Noble

unread,
Feb 21, 2020, 4:04:13 PM2/21/20
to Jordan Force, crux-users
Hi Jordan,
I don't think it's possible to reproduce the behavior shown in Figure 4 without modifying the source code of Crux.  The difference between the panels has to do with how the background probabilities associated with individual amino acids are calculated.  In the current version of Crux, we use the different backgrounds (N-term, C-term, other) and don't provide the user with a way to modify that behavior.

Incidentally, I don't think it's necessary to do the first half of step 4 (randomly picking one p-value per spectrum). All of the p-values should be randomly distributed, so you should get a roughly uniform distribution with or without that step.

Bill


--
You received this message because you are subscribed to the Google Groups "crux-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to crux-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/crux-users/f273800e-6941-4c51-af95-e8e4013d4f6f%40googlegroups.com.

Andy

unread,
Feb 21, 2020, 4:39:19 PM2/21/20
to crux-users
Hi Jordan,

Bill asked me to look into your statement about tide-index and '{X}|{X}. Can you post or send a link to the command and FASTA file you used?

Thanks,
Andy

Jordan Force

unread,
Feb 25, 2020, 2:14:23 PM2/25/20
to crux-users
Sure, Andy. 

This is the command I used: /home/jordan/crux tide-index ../peptides/crux-output/decoys.fasta Decoys --decoy-format none --custom-enzyme "{X}|{X}" --enzyme custom-enzyme --peptide-list T --overwrite T. I've tried using both double and single quotes around the custom-enzyme argument, and I've tried this with both zsh and bash. 



Thanks,

Jordan

Jordan Force

unread,
Feb 25, 2020, 2:47:49 PM2/25/20
to crux-users
I think I may see the crux of the confusion here. I know that for figure 3, you didn't include the terminal positions in the shuffling when generating the decoys. This makes sense, because the null hypothesis is that the non-terminal positions of the peptides are drawn according to the background amino acid distribution in the proteome, while the terminal positions are drawn according to some other amino acid distributions (based partially on the preferences of Trypsin). 

However, for figure 4(B), I took your null hypothesis to be: the peptide is drawn according to the same background amino acid distribution at all positions. Therefore, I assumed that the decoys used to create plot 4(B) were generated by including the terminal positions in the shuffling, which should mean that the crux's use of different backgrounds for N-term and C-term positions shouldn't make a big difference. Am I misunderstanding how the decoys for figure 4(B) were generated in the paper? 


On Friday, February 21, 2020 at 4:04:13 PM UTC-5, Bill Noble wrote:
Hi Jordan,
I don't think it's possible to reproduce the behavior shown in Figure 4 without modifying the source code of Crux.  The difference between the panels has to do with how the background probabilities associated with individual amino acids are calculated.  In the current version of Crux, we use the different backgrounds (N-term, C-term, other) and don't provide the user with a way to modify that behavior.

Incidentally, I don't think it's necessary to do the first half of step 4 (randomly picking one p-value per spectrum). All of the p-values should be randomly distributed, so you should get a roughly uniform distribution with or without that step.

Bill


On Thu, Feb 20, 2020 at 6:28 AM Jordan Force <jorda...@uconn.edu> wrote:
Hi everyone,

This question is more for Bill Noble and Jeffry Howbert, but I thought it would be reasonable to ask it on the mailing list. I'm trying to replicate the results of figure 4B in the paper about computing p-values with XCorr (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4159662/pdf/zjw2467.pdf). There, they plot p-value against p-value rank for 10,000 random decoy PSMs from the yeast dataset, and find that the p-values are not uniformly distributed when identical abundance estimates are used at all peptide positions. However, I'm getting pretty different results -- the p-values I get are fairly uniform. I can go into more detail about how exactly I tried to replicate, but this is my strategy:

  1. Use the generate-peptides utility to digest the yeast proteome with a trypsin/p setting, and "--keep-terminal-aminos none" so that the terminal positions get shuffled when creating the decoys
  2. Convert the generate-peptides.decoy.txt file to FASTA, and call tide-index to create an index from it. I use the "--decoy-format none", since we don't need to make a second set of decoys here. Since I don't need to digest again, I use "--custom-enzyme '[Z]|[Z]' --enzyme custom-enzyme".
  3. Search the yeast MGF file using "--exact-p-value T --top-match 10000000", so that all matches would be outputted for each scan
  4. Use a script I wrote to pick a random match for each spectra in the output generated above, and plot the log p-value vs log p-value rank

This is the plot I get:

pvals.png


This is much closer to being uniform than figure 4B in the paper. What am I doing differently? I recognize that this paper is a few years old, so some of the details may have been lost to time. However, any advice you could provide would be extremely helpful. 


By the way, for the second step, the tide-index documentation says to use '{X}|{X}' to prevent digestion. I tried this, but I get "FATAL: No target sequences generated. Is '...' a FASTA file?"; using '[Z]|[Z]' doesn't cause this error, and it prevents digestion (based on the tide-index.peptides.targe.txt file created by tide-index). I'm using crux 3.1.


Thanks,


Jordan

--
You received this message because you are subscribed to the Google Groups "crux-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to crux-...@googlegroups.com.

Andy

unread,
Feb 25, 2020, 6:52:01 PM2/25/20
to crux-users
Jordan,

I am unable to replicate the error that you get. In fact I get the exact opposite, when I use --custom-enzyme "{X}|{X}", tide-index proceeds to the end. When I use --custom-enzyme "{Z}|{Z}", I get the error that you receive. One thing is that I am using Crux 3.2 instead of Crux 3.1 Is there any reason why you do not wish to use the newest version of Crux (3.2)? If not, I would suggest upgrading and seeing if the problem persists. 

Andy

Jordan Force

unread,
Feb 28, 2020, 1:58:43 PM2/28/20
to crux-users
I was using Crux 3.1 simply because it's what I had installed. I just installed crux 3.2 from here: https://noble.gs.washington.edu/crux-downloads/crux-3.2/crux-3.2.Linux.x86_64.zip, and tried out tide-index using "{X}|{X}" for custom-enzyme, and still get the same error. According to the version utility, the exact crux version is: 3.2-0d57cff, and this is the exact command I used:

./crux tide-index /data1/jordan/tide_p_value/yeast_all_search/index/crux-output/decoys.fasta Decoys --decoy-format none --custom-enzyme "{X}|{X}" --enzyme custom-enzyme --overwrite T

Using "[Z]|[Z]" (not {Z}|{Z}) works though.

Andy

unread,
Mar 3, 2020, 4:02:40 PM3/3/20
to crux-users
Hi Jordan,

Ok, I think we figured it out. I think I am currently using the latest daily build of Crux and you are using the release version of Crux 3.2. When you download it from the website (http://crux.ms/download.html), you want 'most recent build', not the release version. From your link you want to go into the daily folder and find the latest version.

Let me know if that works.

Thanks,
Andy

Jordan Force

unread,
Mar 5, 2020, 2:30:45 PM3/5/20
to crux-users
Yup, it works with the latest version. 
Reply all
Reply to author
Forward
0 new messages