From the paper, we have the following description of constructing the fisher test:
GIGGLE eliminates this complexity by estimating the significance and enrichment between the query intervals and each indexed interval file with a Fisher's Exact two-tailed test and the odds ratio of a 2 × 2 contingency table containing the number of intervals that are in (i) both the query and indexed file, (ii) solely the query file, (iii) solely the indexed file, and (iv) neither the query file nor the indexed file. The first three values are directly computed with a GIGGLE search, and the last value is estimated by the difference between the union of the two sets and the quotient of the mean interval size of both sets and the genome size.
I would like to be able to recreate the fisher contingency table for given query and index bed files, and it's a bit unclear how I would actually go about calculating i-iv above.
- For (i), is this the number of intervals in query which overlap index? number of intervals in index which overlap query? The sum of both? Not sure how to deal with a case where an interval in query contains 2 or more intervals in index; would this be 1, 2, or 3 "intervals in both the query and indexed file"?
- For (ii) and (iii), this seems fairly easy; (the intervals in query which do not overlap with index) and (the intervals in index which do not overlap with query). basically the output of bedtools intersect -v; please correct me if I'm wrong.
- For (iv), GIGGLE computes as (genome_size / mean_interval_size) - ("union of the two sets"). Again, union seems unclear where intervals can partially overlap; would this be the output of e.g. bedtools merge?
In short, I'd like to be able to recreate the contingency table, given two arbitrary bed files, thanks!