more benchmarks

58 views
Skip to first unread message

Istvan Albert

unread,
Feb 11, 2009, 2:38:57 PM2/11/09
to pygr-dev

I have been long interested in comparing some of the performance
numbers across bioinformatics frameworks. Yesterday Jenny brought this
up, that gave me the push to see some numbers for myself.

Below find some comparison numbers for fasta parsing, iteration,
sequence slicing, reverse complement for 4 bioinformatics frameworks:

Pygr: http://code.google.com/p/pygr/
Biopython: http://biopython.org/wiki/Main_Page
BX Python: http://bx-python.trac.bx.psu.edu/
PyCogent: http://pycogent.sourceforge.net/

Here are the results: http://code.google.com/p/pygr/source/browse/contrib/compare/results.txt

The code: http://code.google.com/p/pygr/source/browse/contrib/compare/benchmark.py

Guess what the numbers were ... here is summary:

=============================

*** benchmarking=data/100K.fasta

Test Time
pygr_parse_fasta 5.7s
bio_parse_fasta 24.8s
bx_parse_fasta 28.8s
cogent_parse_fasta 56.1s
--------------------
pygr_iter 14.5s
bio_iter 6.7s
bx_iter 5.8s
cogent_iter 8.2s
--------------------
pygr_slice 11.8s
bio_slice 1.4s
bx_slice 0.8s
cogent_slice 11.9s
--------------------
pygr_reverse_comp 35.6s
bio_reverse_comp 12.4s
bx_reverse_comp 8.1s
cogent_reverse_comp 44.4s

========================================

*** benchmarking=data/dm.fasta

Test Time
pygr_parse_fasta 36.6s
bio_parse_fasta 17.7s
bx_parse_fasta 29.1s
cogent_parse_fasta 40.2s
--------------------
pygr_iter 0.0s
bio_iter 3.0s
bx_iter 2.9s
cogent_iter 2.9s
--------------------
pygr_slice 11.5s
bio_slice 1.4s
bx_slice 0.8s
cogent_slice 12.1s
--------------------
pygr_reverse_comp 137.8s
bio_reverse_comp 5.3s
bx_reverse_comp 31.7s
cogent_reverse_comp 26.6s

James Taylor

unread,
Feb 11, 2009, 2:44:07 PM2/11/09
to pygr...@googlegroups.com
Interesting. FWIW, the fasta parser in bx has had little effort put
into optimization (Bob Harris did some). The nib and 2bit parsers are
hopefully where it shines.

It'd be really cool to compare some of the other features like
IntervalIndexFile vs. NLMSA. I've not had time to do it, but it might
be interesting (and encourage one or all of us to do some
optimization ;).

Istvan Albert

unread,
Feb 11, 2009, 3:15:55 PM2/11/09
to pygr-dev


On Feb 11, 2:44 pm, James Taylor <ja...@jamestaylor.org> wrote:

> It'd be really cool to compare some of the other features like  
> IntervalIndexFile vs. NLMSA. I've not had time to do it, but it might  
> be interesting (and encourage one or all of us to do some  
> optimization ;).

lately I've become interested in exploring interval overlap query
algorithms. It looks like just about everything in the genome is
transcribed at some point, thus we will end up with orders of
magnitude more interval data than what we were used to have. Having a
seamless way to query these is just as important as being able to
fetch a sequence and operate on it.

If you guys write some docs on how this IntervalIndexFile is used in
practice, with some some examples etc. I'll write benchmarks to
compare it to other type of queries.

Istvan

James Taylor

unread,
Feb 11, 2009, 3:24:38 PM2/11/09
to pygr...@googlegroups.com
I'll try to find some time to write some more docs soon, but fyi there
are actually several overlap query algorithms in bx of different types.

IntervalIndexFile is mostly used for indexing large alignments on disk
by position. It a tree of bins of fixed size (unlike the dynamic
binning the NCLs do). The actual tools to use it are mostly MAF
oriented, but I can write a simple script for indexing something more
arbitrary. (The thing it actually stores in the on disk representation
is an offset to the data of interest in another file). It is pretty
disk oriented, and works well on network file systems since it only
needs to read the subset of bins needed to find intervals in a
specific region.

There is also intervals.Intersecter which is of the offline sorted
endpoint / binary tree sort. This is in memory only.

And there is quicksect.py, which is a different type of tree, and is
what the join tool in Galaxy uses. This is also in memory only but
uses an extension module -- I think it can be a lot faster than
Intersecter.

Maybe this weekend I can put together some quick examples.

-- jt

James Taylor

unread,
Feb 18, 2009, 2:25:54 PM2/18/09
to pygr...@googlegroups.com, Brent Pedersen
As requested, I've put together an example of using the different
intersection algorithms in bx, with some rough timings. The algorithms
are:

-- Intersecter: this is the oldest and uses sorted endpoints. The
search phase appears to be very slow, could be sped up a lot by
improving that, but probably worth it (having run this, I'm probably
going to just this and make it a wrapper to quicksect)

-- quicksect: interval tree with random balance. This is what the join
tool in Galaxy uses

-- quicksect.pyx: a 'port' of quicksect to Cython from Brent Pederson.
This is really nice, but it doesn't have exactly the same interface as
the other quicksect so it hadn't been integrated yet. One important
thing about this is that it can also do "neighborhood" queries -- find
the 10 closest features upstream of ...

-- interval_index_file.Index: tree of bins + sorted lists in bins. The
advantage of this is that it has a nice on-disk representation. One
can build an index of a huge number of intervals, save it, and then
queries require only loading a subset of the bins. Probably the
closest thing to NCList. Pure Python, so room for optimization. It can
be used just in memory or on disk, I've timed both. Note: normally
this should be used through the wrapper "Indexes" which writes nice
headers with version, value size, et cetera.

Code: http://bx.mathcs.emory.edu/hg/overlap-query-testing/

The output:

Big to small

python overlap_examples.py data/
hg18_chr1_phastConsElements44wayPlacental.bed.bz2 data/
hg18_chr1_knownGene.bed.bz2
Query intervals: data/hg18_chr1_phastConsElements44wayPlacental.bed.bz2
Target intervals: data/hg18_chr1_knownGene.bed.bz2
---> Using bx.intervals.intersection.Intersecter
Building: 0.026 seconds
Using: 14.943 seconds
---> Using bx.intervals.operations.quicksect.IntervalNode
Building: 0.415 seconds
Using: 9.561 seconds
---> Using quicksect.IntervalNode (Cython version from Brent Pederson)
Building: 0.019 seconds
Using: 0.268 seconds
---> Using bx.interval_index_file.Index (in memory)
Building: 0.026 seconds
Using: 5.947 seconds
---> Using bx.interval_index_file.Index (on disk)
Building: 0.026 seconds
Writing: 0.069 seconds
Using: 3.290 seconds

Small to big

python overlap_examples.py data/hg18_chr1_knownGene.bed.bz2 data/
hg18_chr1_phastConsElements44wayPlacental.bed.bz2
Query intervals: data/hg18_chr1_knownGene.bed.bz2
Target intervals: data/hg18_chr1_phastConsElements44wayPlacental.bed.bz2
---> Using bx.intervals.intersection.Intersecter
Building: 2.108 seconds
Using: 14.349 seconds
---> Using bx.intervals.operations.quicksect.IntervalNode
Building: 31.120 seconds
Using: 1.331 seconds
---> Using quicksect.IntervalNode (Cython version from Brent Pederson)
Building: 2.208 seconds
Using: 0.049 seconds
---> Using bx.interval_index_file.Index (in memory)
Building: 2.371 seconds
Using: 0.987 seconds
---> Using bx.interval_index_file.Index (on disk)
Building: 2.694 seconds
Writing: 2.487 seconds
Using: 0.111 seconds


On Feb 11, 2009, at 3:15 PM, Istvan Albert wrote:

Christopher Lee

unread,
Feb 18, 2009, 3:27:37 PM2/18/09
to pygr...@googlegroups.com
Fantastic. This will motivate me to add C extension code for such
joins to Pygr and to test performance on the same data that you used.

Thanks, James!

-- Chris

Istvan Albert

unread,
Feb 18, 2009, 8:12:33 PM2/18/09
to pygr-dev


On Feb 18, 2:25 pm, James Taylor <ja...@jamestaylor.org> wrote:
> As requested, I've put together an example of using the different  
> intersection algorithms in bx, with some rough timings.

Nice. I'll see how it compares to other approaches. Also maybe you've
just created the quintessetial dataset and timing that will be used to
compare all interval querying from now on....

On a related note I've read an interesting blog post today on using
python for interval query.

http://www.logarithmic.net/pfh/blog/01234937824

It looks like a variant of binning, but one that combines the bin
number with the size of the interval. The author calls this range
codes. Has anyone seen this type of approach before?

cheers,

Istvan

brentp

unread,
Feb 18, 2009, 9:09:47 PM2/18/09
to pygr-dev
my first try seems to be lost, trying again.
hi,
this approach is similar to geohashing:
http://en.wikipedia.org/wiki/Geohash
of which there's a python implementation here:
http://mappinghacks.com/code/geohash.py.txt

a while back i wrote (and abandoned) a 'biohash' version to store
intervals/ranges as
a string of 0/1's suitable for BTree indexing on which one could then
do range queries.
gisted here:
http://gist.github.com/66671
and cython version:
http://gist.github.com/66673

-brentp

brentp

unread,
Feb 18, 2009, 8:59:20 PM2/18/09
to pygr-dev
On Feb 18, 5:12 pm, Istvan Albert <istvan.alb...@gmail.com> wrote:
hi,
this approach is similar to geohashing: http://en.wikipedia.org/wiki/Geohash
of which there's a python implementation here:
http://mappinghacks.com/code/geohash.py.txt
the idea is to allow you to do spatial queries fairly efficiently with
a BTree.

i wrote (and abandoned) a biohash version a while back.
gisted here:
http://gist.github.com/66671
and i just found my cython version:
http://gist.github.com/66673


-brent

Istvan Albert

unread,
Feb 19, 2009, 8:42:55 AM2/19/09
to pygr-dev


On Feb 18, 8:59 pm, brentp <bpede...@gmail.com> wrote:

> this approach is similar to geohashing:http://en.wikipedia.org/wiki/Geohash
> of which there's a python implementation here:http://mappinghacks.com/code/geohash.py.txt
> the idea is to allow you to do spatial queries fairly efficiently with
> a BTree.

Interesting. Thanks for sharing.

Istvan Albert

unread,
Feb 25, 2009, 1:05:10 PM2/25/09
to pygr-dev


On Feb 18, 2:25 pm, James Taylor <ja...@jamestaylor.org> wrote:

> ---> Using quicksect.IntervalNode (Cython version from Brent Pederson)
> Building: 0.019 seconds
> Using: 0.268 seconds

I just sat down to try this out myself ... and whoa ... this is eye-
poppingly fast, I missed it on first read as it was nestled in between
the other numbers.

this is a querying 350,000 intervals against 6K genes .... and it even
contains the looping and function call costs too, those are probably
0.1 sec of it.

Very impressive. I got to get in on the action on this.

Istvan



Paul

unread,
Mar 3, 2009, 1:06:10 AM3/3/09
to pygr-dev

Hi. Interesting discussion!

On Feb 19, 12:12 pm, Istvan Albert <istvan.alb...@gmail.com> wrote:
> On Feb 18, 2:25 pm, James Taylor <ja...@jamestaylor.org> wrote:
>
> > As requested, I've put together an example of using the different  
> > intersection algorithms in bx, with some rough timings.
>
> Nice. I'll see how it compares to other approaches. Also maybe you've
> just created the quintessetial dataset and timing that will be used to
> compare all interval querying from now on....
>
> On a related note I've read an interesting blog post today on using
> python for interval query.
>
> http://www.logarithmic.net/pfh/blog/01234937824
>

I've since come up with what I consider a better approach. Fast enough
for my needs and easy to implement in an SQL database.

http://www.logarithmic.net/pfh/blog/01235197474

At Istvan's prompting, I've tested this with James' test data set. I
assumed pythonic intervals, ie lower<=x<upper. For
the small-to-big test it doesn't do too badly. There's probably some
room for optimization by converting from Python to C code.

http://www.logarithmic.net/pfh-files/blog/01235197474/pfh_example.py

Timing results on my server:

Query intervals: hg18_chr1_phastConsElements44wayPlacental.bed.bz2
Target intervals: hg18_chr1_knownGene.bed.bz2
---> Using pfh's method, pure python version
Building: 0.020 seconds
Using: 26.290 seconds
---> Using pfh's method, SQLite version
Building: 0.130 seconds
Using: 38.550 seconds


Query intervals: hg18_chr1_knownGene.bed.bz2
Target intervals: hg18_chr1_phastConsElements44wayPlacental.bed.bz2
---> Using pfh's method, pure python version
Building: 2.060 seconds
Using: 0.850 seconds
---> Using pfh's method, SQLite version
Building: 6.890 seconds
Using: 2.840 seconds


cheers,
Paul Harrison / VBC / Monash University

Istvan Albert

unread,
Mar 3, 2009, 12:03:21 PM3/3/09
to pygr-dev

Thanks for the comparisons Paul. It is impressively fast for a
reasonably simple pure python based solution. Based on nice ideas too.
We'll soon end up with the paradox of choice.

I'll see if can summarize all these on a website for the community at
large. Those blessed with good blogging skills may want to do it as
well (hint hint)

Istvan



Reply all
Reply to author
Forward
0 new messages