As requested, I've put together an example of using the different
intersection algorithms in bx, with some rough timings. The algorithms
are:
-- Intersecter: this is the oldest and uses sorted endpoints. The
search phase appears to be very slow, could be sped up a lot by
improving that, but probably worth it (having run this, I'm probably
going to just this and make it a wrapper to quicksect)
-- quicksect: interval tree with random balance. This is what the join
tool in Galaxy uses
-- quicksect.pyx: a 'port' of quicksect to Cython from Brent Pederson.
This is really nice, but it doesn't have exactly the same interface as
the other quicksect so it hadn't been integrated yet. One important
thing about this is that it can also do "neighborhood" queries -- find
the 10 closest features upstream of ...
-- interval_index_file.Index: tree of bins + sorted lists in bins. The
advantage of this is that it has a nice on-disk representation. One
can build an index of a huge number of intervals, save it, and then
queries require only loading a subset of the bins. Probably the
closest thing to NCList. Pure Python, so room for optimization. It can
be used just in memory or on disk, I've timed both. Note: normally
this should be used through the wrapper "Indexes" which writes nice
headers with version, value size, et cetera.
Code:
http://bx.mathcs.emory.edu/hg/overlap-query-testing/
The output:
Big to small
python overlap_examples.py data/
hg18_chr1_phastConsElements44wayPlacental.bed.bz2 data/
hg18_chr1_knownGene.bed.bz2
Query intervals: data/hg18_chr1_phastConsElements44wayPlacental.bed.bz2
Target intervals: data/hg18_chr1_knownGene.bed.bz2
---> Using bx.intervals.intersection.Intersecter
Building: 0.026 seconds
Using: 14.943 seconds
---> Using bx.intervals.operations.quicksect.IntervalNode
Building: 0.415 seconds
Using: 9.561 seconds
---> Using quicksect.IntervalNode (Cython version from Brent Pederson)
Building: 0.019 seconds
Using: 0.268 seconds
---> Using bx.interval_index_file.Index (in memory)
Building: 0.026 seconds
Using: 5.947 seconds
---> Using bx.interval_index_file.Index (on disk)
Building: 0.026 seconds
Writing: 0.069 seconds
Using: 3.290 seconds
Small to big
python overlap_examples.py data/hg18_chr1_knownGene.bed.bz2 data/
hg18_chr1_phastConsElements44wayPlacental.bed.bz2
Query intervals: data/hg18_chr1_knownGene.bed.bz2
Target intervals: data/hg18_chr1_phastConsElements44wayPlacental.bed.bz2
---> Using bx.intervals.intersection.Intersecter
Building: 2.108 seconds
Using: 14.349 seconds
---> Using bx.intervals.operations.quicksect.IntervalNode
Building: 31.120 seconds
Using: 1.331 seconds
---> Using quicksect.IntervalNode (Cython version from Brent Pederson)
Building: 2.208 seconds
Using: 0.049 seconds
---> Using bx.interval_index_file.Index (in memory)
Building: 2.371 seconds
Using: 0.987 seconds
---> Using bx.interval_index_file.Index (on disk)
Building: 2.694 seconds
Writing: 2.487 seconds
Using: 0.111 seconds
On Feb 11, 2009, at 3:15 PM, Istvan Albert wrote: