BEDOPS benchmarks

63 views
Skip to first unread message

Gert Hulselmans

unread,
Dec 10, 2013, 7:02:01 PM12/10/13
to bedops-...@googlegroups.com
Hi,


> 1.2.2. BEDOPS tools are fast and efficient
> BEDOPS tools take advantage of the information in a sorted BED file to use only what data are
> needed to perform the analysis. Our tools are agnostic about genomes: Run BEDOPS tools on
> genomes as small as Circovirus or as large as Polychaos dubium!


At least in BEDTools v2.17 intersectBed has a -sorted option (which should improve the performance):

    -sorted Use the "chromsweep" algorithm for sorted (-k1,1 -k2,2n) input



> BEDOPS also introduces a novel and lossless compression format called Starch that
> reduces whole-genome BED datasets to ~5% of their original size (and BAM datasets
> to roughly 35% of their original size), while adding useful metadata and random access,
> allowing instantaneous retrieval of any compressed chromosome:

As the Starch format is here compared to (bzip2-ed) bedGraph and (bzip2-ed) Wig files,
it would be a better comparison to also compare it to bigWig format:


 

The bigWig format is for display of dense, continuous data that will be displayed in the Genome Browser as a graph. BigWig files are created initially from wiggle (wig) type files, using the program wigToBigWig. The resulting bigWig files are in an indexed binary format. The main advantage of the bigWig files is that only the portions of the files needed to display a particular region are transferred to UCSC, so for large data sets bigWig is considerably faster than regular wiggle files. The bigWig file remains on your web accessible server (http, https, or ftp), not on the UCSC server. Only the portion that is needed for the chromosomal position you are currently viewing is locally cached as a "sparse file".



Another (more general BED like) format is bigBed:

The bigBed format stores annotation items that can either be simple, or a linked collection of exons, much as BED files do. BigBed files are created initially from BED type files, using the program bedToBigBed. The resulting bigBed files are in an indexed binary format. The main advantage of the bigBed files is that only the portions of the files needed to display a particular region are transferred to UCSC, so for large data sets bigBed is considerably faster than regular BED files. The bigBed file remains on your web accessible server (http, https, or ftp), not on the UCSC server. Only the portion that is needed for the chromosomal position you are currently viewing is locally cached as a "sparse file"




And even more general (TAB separated files) is bgzip (compression) and tabix (indexing):


As far as I know bigWig, bigBed and bgzipped files (with tabix index) have a finer index (not only by chromosome),
so when not extracting per chromosome but for a certain region per chromosome, they probably will perform better
than starch (at least when there are a lot of records for one chromosome).


Best,
Gert

Alex Reynolds

unread,
Dec 10, 2013, 7:23:35 PM12/10/13
to bedops-...@googlegroups.com
Hi Gert,

Our numbers should be accurate for where BEDOPS and Bedtools were when we went to publication, but we'll look into updating performance testing when we get close to or at the next major release. Still, for most comparable operations, BEDOPS tools are still faster and use less memory, as shown in the independent test results published earlier this year.

Regards,
Alex

Gert Hulselmans

unread,
Dec 23, 2013, 8:19:50 PM12/23/13
to bedops-...@googlegroups.com
Hi Alex,

I just discovered that bedtools has a new home website:

Their v2.18 release seems to have some mayor speed improvements:

Performance with sorted datasets. The “chromsweep” algorithm we use for detecting intersections is now 60 times faster than when it was first release in version 2.16.2, and is 4 times faster than the 2.17 release. This makes the algorithm slightly faster that the algorithm used in the bedops bedmap tool. As an example, the following figure demonstrates the speed when intersecting GENCODE exons against 1, 10, and 100 million BAM alignments from an exome capture experiment. Whereas in version 2.16.2 this wuld have taken 80 minutes, it now takes 80 seconds.



It would be great if bedops (and/or bedtools) would support the various binary UCSC formats (bigBed and bigWig).
As those formats already contain sorted content, I guess not much code of bedops would need to change.

Regards,
Gert

Op woensdag 11 december 2013 01:23:35 UTC+1 schreef Alex Reynolds:

Alex Reynolds

unread,
Dec 23, 2013, 9:10:02 PM12/23/13
to Gert Hulselmans, Alex Reynolds, bedops-...@googlegroups.com
While a few other binary formats might provide finer-grained access to genomic regions, there is a lot more storage overhead with the indices that these formats need to carry alongside the actual genomic data.

Starch provides fast access to per-chromosome regions of data with very little index overhead, relatively speaking. Literally on the order of a few kilobytes in Starch, compared with tens to hundreds of megabytes with alternative formats, depending on how much data is being compressed. While insignificant for a few files, for hundreds and thousands of archives, this extra space savings can be very useful.

More than just efficient archival of BED data, Starch also precomputes useful statistics for sequencing analysis, providing instantaneous access to element counts, base counts, and unique base counts on a per-chromosome and whole archive basis.

Starch also provides a way to tag archives with open-ended metadata that can be used for marking them up with useful, descriptive information about the experiment, data provenance, lab equipment, etc. I don't believe other formats provide these options.

As always, you should choose the data formats that work best for your workflow.

Regards,
Alex

Alex Reynolds

unread,
Dec 23, 2013, 9:25:21 PM12/23/13
to Gert Hulselmans, Alex Reynolds, bedops-...@googlegroups.com

On Dec 23, 2013, at 5:19 PM, Gert Hulselmans <hulselm...@gmail.com> wrote:

> It would be great if bedops (and/or bedtools) would support the various binary UCSC formats (bigBed and bigWig).
> As those formats already contain sorted content, I guess not much code of bedops would need to change.

A core design consideration for BEDOPS is the use of UNIX pipes to handle data flow, particularly standard input and output. Piping makes analysis pipeline design easy and improves performance, by reducing the amount of (slow) disk I/O.

There are already tools that process bigBed and bigWig data into BED, and so it should be trivial to pipe the output of those tools into BEDOPS tools.

Specifically, bedops, bedmap and other BEDOPS tools use the hyphen character to replace a filename with standard input.

So given commands called "foo" and "bar" which print unsorted BED data to standard output, you can easily do the following:

$ foo someData.foo | sort-bed - | bedops --range 1000 --everything - > somePaddedData.bed

Or:

$ bar someData.bar | sort-bed - | bedmap --echo --echo-map-id - anotherDataset.bed > someDataWithOverlappingIDs.bed

For a real-world application, consider GFF records that you want to count the number of overlaps with BAM-formatted reads. We can use BEDOPS conversion scripts to handle this analysis:

$ bam2bed < reads.bam > reads.bed
$ gff2bed < records.gff | bedmap --echo --count - reads.bed > answer.bed

Using bigBed/bigWig or other format conversion tools that write to standard output, you can integrate their data in similar ways with BEDOPS tools.

Regards,
Alex
Reply all
Reply to author
Forward
0 new messages