Abundance estimates - equal noise for all expression bins.

N Vi

unread,

Jun 10, 2016, 2:53:58 PM6/10/16

to Sailfish Users Group

Hi,

The paper "Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data" shows that "High expression levels are more accurately estimated than low expression levels". Variance in the estimates from using --bootstrap option also show higher variance for lowly expressed genes.

Could i have equal noise/variance in the abundance estimates for all genes by having the same number of reads/kb for all genes? How would the distribution of these reads in the different equivalence classes affect the estimate?

Is it possible to get the sequence coordinates of the equivalence classes, so that the expression level can be digitally normalized across all genes.

Thanks,
N Vi

Rob

unread,

Jun 15, 2016, 6:09:04 PM6/15/16

to Sailfish Users Group

Hi N Vi,

I'm not quite certain I understand your question. It is true that at low expression levels, the variance tends to be higher — that is, low abundance transcripts are more difficulty to quantify accurately than highly-abundant transcripts.

If all genes have the same sampling rate (i.e. the same number of reads / kb), then this implies that they will have the same abundance in terms of units which measure sampling rate (like TPM, FPKM, etc.). Note, you refer here to genes; of course, having a uniform sampling rate at the gene level doesn't necessarily imply that all of the constituent transcripts will have the same rate — that depends on coverage etc. Of course, such an equal sampling rate doesn't seem to happen biologically, thus the nature of the transcript abundance estimation problem we face in reality.

Each equivalence class corresponds to, perhaps, many different coordinates. This is because fragments are placed in an equivalence class if they map to an identical set of transcripts. This can happen at multiple different coordinates among these transcripts. For example, imagine a gene with 3 transcripts t1, t2, t3 derived from a set of exons {e1, e2, e3, e4} such that t1 = {e1, e3, e4}, t2 = {e1, e3}, t3={e1, e2}. Any fragment mapping to exon 1 will go into an equivalence class labeled by transcripts t1, t2 and t3. Any fragment mapping to the e1-e3 boundary will go into an equivalence class labeled by t1 and t2, as will any fragment mapping entirely within e3. Any fragment mapping to the e1-e2 boundary or entirely within e2 will go into an equivalence class labeled by t3. There are some other cases as well. But here, you can see that both fragments crossing the e1-e3 boundary, and those mapping (anywhere) within e3 will go into the equivalence class labeled with t1 and t2. Thus, this equivalence class doesn't represent a single locus and may, in fact, represent many distinct loci.

Perhaps I've misinterpreted your question. If so, please let me know.

Best,

Rob

N Vi

unread,

Jun 15, 2016, 7:21:38 PM6/15/16

to Sailfish Users Group

Hi Rob,

Thank you for finding time to answer my question. Apologies for not being clear.

1) In this case, i am not interested in the expression level at the gene level. The only thing that i want to estimate is the relative abundance of isoforms within a gene. These relative abundances of isoforms are better estimated for highly expressed genes compared to lowly expressed genes. I want to make sure that i estimate the relative isoform abundances in all genes with the same level of accuracy/power. Even if this means i reduce the accuracy with which i estimate the relative abundance of highly expressed genes, in the end the level of noise would be the same for all genes. In the end i might choose a set that consists of majority of genes.

2) Thank you for explaining the concept of equivalence class. I am familiar with your --dumpEq option and understand that it can be multiple distinct loci. I would like a bed record of all loci that make up each equivalence class. My hope is that i can sample reads so that i have the same number of reads/kb in each gene while making sure that the ratio of the reads sampled is the same as the ratio found in the original equivalence classes.

Hope this clears things a bit. Kindly let me know if something needs more information.