intersections, significance

332 views
Skip to first unread message

Davide Cittaro

unread,
Mar 18, 2011, 8:52:29 AM3/18/11
to bedtools...@googlegroups.com
Hi all, 
When using intersectBed to check overlaps between two sets of features, I often have to answer questions like "is the overlap between features A and features B somehow significant?". 
In order to do this I usually shuffleBed features A (or B) several times (10k) and count the number of intersections (or its extent) in order to estimate the distribution and tell something about the original overlap. Other say it would be easier to use a binomial distribution to give a p-value for the overlaps... 
I would like to ask a couple of things:
1- what does the BEDTools users community think about significance of overlaps? How would they estimate it?
2- once there's a general agreement about this, wouldn't be nice to have an additional option for, say, interserctBed that outputs the value?

d

/*
Davide Cittaro, PhD

Cogentech - Consortium for Genomic Technologies
via adamello, 16
20139 Milano
Italy

*/




Aaron Quinlan

unread,
Mar 18, 2011, 4:40:23 PM3/18/11
to bedtools...@googlegroups.com, Ryan Layer
Hi Davide,

Significance testing is an area that I am very interested in.  In fact, I've been working with a very talented CS grad student (Ryan Layer, copied) on this exact topic.  Like shuffleBed, the approaches Ryan is developing are based on Monte Carlo simulations where one randomly distributes the "B" features many thousands of times and counts the number of times these shuffle recapitulate the same level of overlap as your observation. Ryan is on the list, so he may chime in and elaborate on his methods, which involve using high-performance computing architectures for rapid and broad-scale significance screens. 

The idea of incorporating this flavor of sig. testing into intersectBed is interesting, though Id rather expand shuffleBed to generate such P-values.  

I am a bit confused about the binomial model.  Is the idea that there is a certain probability for overlap (heads) and non-overlap (tails) and that the significance is derived from a binomial based on this?  If so, this is interesting, as one important downside of Monte Carlo trials is that they do not respect the relative distances of B features in the genome.  In some cases, these distances/densities may be biologically relevant.

Could you elaborate a bit on the proposed binomial model?  Perhaps a toy example?

Thanks for the timely not.

Best,
Aaron

Anshul Kundaje

unread,
Mar 18, 2011, 5:08:41 PM3/18/11
to bedtools...@googlegroups.com, Aaron Quinlan, Ryan Layer
This is an interesting discussion. Just wanted to chime in.

The null model assumptions of randomly shuffling element  does not account for the 'structural' correlations between elements in the genome e.g. Binding sites of a particular TF might tend to have a higher likelihood of occuring in promoter regions than elsewhere.

In the ENCODE project, we have been faced with similar questions about overlap statistics and Peter Bickel's group at Berkeley (Statistics) have a very elegant solution to creating a more realistic null model thus giving conservative z-scores and p-values. Its called the genome-structure correction. It has been used in the ENCODE pilot paper as well in the following papers.

- The main (rather mathematical) statistics paper is here http://www.stat.berkeley.edu/~bickel/Bickel%20et%20al%202010%20AAS.pdf - For its application to binding site overlap problems you can also check out this paper Developmental roles of 21 Drosophila transcription factors are determined by quantitative differences in binding to an overlapping set of thousands of genomic regions


You can also look up an informal blog post from Ewan Birney at the EBI http://ensembl.blogspot.com/2008_04_01_archive.html

The ENCODE group (specifically the Bickel group) plans to release usable software in a few months which could be potentially integrated with BEDTools (which is truly awesome) or used along side it.

Thanks,
Anshul.

Aaron Quinlan

unread,
Mar 21, 2011, 8:17:10 PM3/21/11
to Anshul Kundaje, bedtools...@googlegroups.com, Ryan Layer
Hi Anshul,

Thank you very much for this.  I must admit to being utterly unaware of Peter Bickel's work.  I also must admit that I don't have my head entirely wrapped around the null model, but I hope to make my way through the papers you sent.  I am certainly open to working to incorporate new models for statistical testing, especially MC's ignorance of spatial relationships that features have in the genome.  If you know of any (prefeerably well-documented) implementations of this approach, please let me know.

Have you thought about the binomial model that Davide broight forth?

Best,
Aaron


dawe

unread,
Mar 22, 2011, 8:45:27 AM3/22/11
to bedtools-discuss

> I am a bit confused about the binomial model.  Is the idea that there is a certain probability for overlap (heads) and non-overlap (tails) and that the significance is derived from a binomial based on this?  If so, this is interesting, as one important downside of Monte Carlo trials is that they do not respect the relative distances of B features in the genome.  In some cases, these distances/densities may be biologically relevant.
>
> Could you elaborate a bit on the proposed binomial model?  Perhaps a toy example?

Mmm... I'm not sure it is a proper model, btw a colleague of mine
suggested that if you want to estimate the p of k features A
overlapping features B, one can calculate
p as coverage of B
n as number of features in B
and then build the distribution B(n, p) and calculate the p of having
more than k overlapping features (looking at cdf?).
Besides the fact I'm not sure this is "theoretically" correct, I also
guess that this model does not take in proper consideration feature
size (relative to genome or between the two feature sets).

I'm using the bootstrap method, although I'm not sure the set of
bootstrapped interserctions is distributed as a normal or poisson or
other (it looks normal, though).
d
Reply all
Reply to author
Forward
0 new messages