Group: http://groups.google.com/group/bedops-discuss/topics
- BEDOPS benchmarks [3 Updates]
Gert Hulselmans <hulselm...@gmail.com> Dec 23 05:19PM -0800
Hi Alex,
I just discovered that bedtools has a new home website:
https://github.com/arq5x/bedtools2
Their v2.18 release seems to have some mayor speed improvements:
*Performance with sorted datasets.* The “chromsweep” algorithm we use for
detecting intersections is now *60 times faster* than when it was first
release in version 2.16.2, and is *4 times faster* than the 2.17 release.
This makes the algorithm slightly faster that the algorithm used in the
bedops bedmap tool. As an example, the following figure<https://dl.dropboxusercontent.com/u/515640/bedtools-intersect-sorteddata.png>demonstrates the speed when intersecting GENCODE exons against 1, 10, and
100 million BAM alignments from an exome capture experiment. Whereas in
version 2.16.2 this wuld have taken 80 minutes, *it now takes 80 seconds*.
http://quinlanlab.org/software-releases/bedtools-2.18.html
It would be great if bedops (and/or bedtools) would support the various
binary UCSC formats (bigBed and bigWig).
As those formats already contain sorted content, I guess not much code of
bedops would need to change.
Regards,
Gert
Op woensdag 11 december 2013 01:23:35 UTC+1 schreef Alex Reynolds:
Alex Reynolds <alexpr...@gmail.com> Dec 23 06:10PM -0800
While a few other binary formats might provide finer-grained access to genomic regions, there is a lot more storage overhead with the indices that these formats need to carry alongside the actual genomic data.
Starch provides fast access to per-chromosome regions of data with very little index overhead, relatively speaking. Literally on the order of a few kilobytes in Starch, compared with tens to hundreds of megabytes with alternative formats, depending on how much data is being compressed. While insignificant for a few files, for hundreds and thousands of archives, this extra space savings can be very useful.
More than just efficient archival of BED data, Starch also precomputes useful statistics for sequencing analysis, providing instantaneous access to element counts, base counts, and unique base counts on a per-chromosome and whole archive basis.
Starch also provides a way to tag archives with open-ended metadata that can be used for marking them up with useful, descriptive information about the experiment, data provenance, lab equipment, etc. I don't believe other formats provide these options.
As always, you should choose the data formats that work best for your workflow.
Regards,
Alex
Alex Reynolds <alexpr...@gmail.com> Dec 23 06:25PM -0800
> It would be great if bedops (and/or bedtools) would support the various binary UCSC formats (bigBed and bigWig).
> As those formats already contain sorted content, I guess not much code of bedops would need to change.
A core design consideration for BEDOPS is the use of UNIX pipes to handle data flow, particularly standard input and output. Piping makes analysis pipeline design easy and improves performance, by reducing the amount of (slow) disk I/O.
There are already tools that process bigBed and bigWig data into BED, and so it should be trivial to pipe the output of those tools into BEDOPS tools.
Specifically, bedops, bedmap and other BEDOPS tools use the hyphen character to replace a filename with standard input.
So given commands called "foo" and "bar" which print unsorted BED data to standard output, you can easily do the following:
$ foo someData.foo | sort-bed - | bedops --range 1000 --everything - > somePaddedData.bed
Or:
$ bar someData.bar | sort-bed - | bedmap --echo --echo-map-id - anotherDataset.bed > someDataWithOverlappingIDs.bed
For a real-world application, consider GFF records that you want to count the number of overlaps with BAM-formatted reads. We can use BEDOPS conversion scripts to handle this analysis:
$ bam2bed < reads.bam > reads.bed
$ gff2bed < records.gff | bedmap --echo --count - reads.bed > answer.bed
Using bigBed/bigWig or other format conversion tools that write to standard output, you can integrate their data in similar ways with BEDOPS tools.
Regards,
Alex