FDR small data set size

52 views
Skip to first unread message

Bassel

unread,
Mar 11, 2019, 4:59:45 AM3/11/19
to PeptideShaker
Hello,

I am using SearchGUI and PeptideShaker to identify proteins in a small size sample (ca. 20 proteins). 
I would like to know how to estimate the FDR for this kind of samples. When using FDR of 1% all proteins in the samples are not validated including the ones I am sure they in the sample (spiked in it). 
Can I basically apply a higher values when the expected number of proteins in the sample are low?

Thanks,
Bassel 

lnnrt

unread,
Mar 25, 2019, 2:01:27 PM3/25/19
to PeptideShaker
Hi Bassel,


With very small database sizes, you essentially give up the ability to assess the FDR.

This happens because there are simply too few tryptic peptide sequences that can be derived from the database, which also means too few decoy peptides.

And because there are too few decoy peptides, it becomes essentially impossible to assess reliably what the scores of bad hits look like. Think of it as trying to assess the average height of a population when you can only measure the height of twelve people; the probability that you get an unrepresentative sample is just too high.

When dealing with such small samples, the best you can do is to ignore the FDR (as it will not be reliable anyway), and instead focus on the actual peptide-to-spectrum matches. These you can evaluate yourself, if you have some experience, but keep in mind that you cannot claim any statistical significance.

Another approach that could work, is to add the 20 proteins of interest to a much larger, unrelated database, and then search that database. In doing so, you create a large background for your spectra to match against, and this brings back FDR power.

Implicitly, you're then doing an entrapment search, in which you have decoys (reversed versions of the background database + of your 20 proteins), and entrapment sequences (the background database). You could therefore even use the amount of background sequence identifications as an estimate of how reliable your hits against your 20 proteins are. Briefly put, you don't want to see too many of these background hits.

Do note that such a background database should be suffiiciently large (over 5.000 proteins, preferrably more like 10.000) and that it should be quite different in terms of sequence from your 20 proteins (because otherwise, there can be overlap).

Having said all of this, you might as well just search against the normal proteome of the species you're investigating (or in which you're expressing proteins, if these are recombinants). This will give you similar results, and may inform you of possible presence of other proteins (due to incomplete purification or similar).


Hope this helps!

Cheers,

lnnrt.

Op maandag 11 maart 2019 09:59:45 UTC+1 schreef Bassel:

Bassel

unread,
Apr 2, 2019, 4:03:27 AM4/2/19
to PeptideShaker
Thanks a lot Innrt for your detailed answer. 
Very Helpful! 

Best,
Bassel
Reply all
Reply to author
Forward
0 new messages