Re: Digest for bedops-discuss@googlegroups.com - 3 Messages in 1 Topic

16 views

Skip to first unread message

Shane Neph

unread,

Dec 24, 2013, 11:00:06 PM12/24/13

to bedops-...@googlegroups.com

I don't seem to get the updates for this group except these digests for some reason. Any suggestions?

I took a look at that png output. Usually bedops -e would be compared with the intersectBed tool found in bedtools. It's worth noting that bedops is much faster than bedmap under usual circumstances. They slapped on a -c to make it more like bedmap's --count feature. We try to separate true 'set-like' operations from mapping operations. Also, if these files contain no fully-nested elements, then bedmap can be run with the --faster option, making it run as fast as the bedops program would, and much faster than the time shown on that graph. bedmap could run much faster if we did not support all of the overlap options that we do. It's still quite fast, and it comes with loads of useful options that can do what many programs in BedTool's suite do together.

Regarding bigwig/bigbed and the dozen or so other formats commonly used in genomics - I have but one simple question...if we can convert these files into the version of "BED" that BEDOPS supports without loss of information, then why do these dozen or so formats exist? Time for standardization in my opinion. starch has proven to compress BED better than any other format, and it provides immediate access to the most natural partition of most genomic data sets - by chromosome. For the rare cases where fast lookups of specific regions is actually meaningful, like the UCSC genome browser, we currently offer the bedextract utility. There are constraints on bedextract (the same as bedmap's --faster option has) which prevent this is the most general setup. This can be fixed pretty easily if people were willing to standardize on a BED-like format, and it would require no extra space, nor an extra index file like bam requires.

Perhaps we will one day support this in our variant of BED, but, honestly, outside of rare cases and a genome browser's use case, the feature of 'quick lookup' is a far overhyped one. It doesn't take very many (relatively speaking) 'quick lookups' before you have greater efficiency sweeping through an entire file, or at least a chromosome's worth of data.

Don't mean to sound bullheaded, and we appreciate the heads up and suggestions. I personally find it annoying that all of these great many psuedo-standard formats have crept into wide use, and yet they offer little to no information that cannot be represented in a simple BED-like format.

Shane

On Tue, Dec 24, 2013 at 3:14 PM, <bedops-...@googlegroups.com> wrote:

Today's Topic Summary

Group: http://groups.google.com/group/bedops-discuss/topics

BEDOPS benchmarks [3 Updates]

BEDOPS benchmarks

Gert Hulselmans <hulselm...@gmail.com> Dec 23 05:19PM -0800

Hi Alex,

I just discovered that bedtools has a new home website:
https://github.com/arq5x/bedtools2

Their v2.18 release seems to have some mayor speed improvements:

*Performance with sorted datasets.* The “chromsweep” algorithm we use for
detecting intersections is now *60 times faster* than when it was first
release in version 2.16.2, and is *4 times faster* than the 2.17 release.
This makes the algorithm slightly faster that the algorithm used in the
bedops bedmap tool. As an example, the following figure<https://dl.dropboxusercontent.com/u/515640/bedtools-intersect-sorteddata.png>demonstrates the speed when intersecting GENCODE exons against 1, 10, and
100 million BAM alignments from an exome capture experiment. Whereas in
version 2.16.2 this wuld have taken 80 minutes, *it now takes 80 seconds*.

http://quinlanlab.org/software-releases/bedtools-2.18.html

It would be great if bedops (and/or bedtools) would support the various
binary UCSC formats (bigBed and bigWig).
As those formats already contain sorted content, I guess not much code of
bedops would need to change.

Regards,
Gert

Op woensdag 11 december 2013 01:23:35 UTC+1 schreef Alex Reynolds:

Alex Reynolds <alexpr...@gmail.com> Dec 23 06:10PM -0800

While a few other binary formats might provide finer-grained access to genomic regions, there is a lot more storage overhead with the indices that these formats need to carry alongside the actual genomic data.

Starch provides fast access to per-chromosome regions of data with very little index overhead, relatively speaking. Literally on the order of a few kilobytes in Starch, compared with tens to hundreds of megabytes with alternative formats, depending on how much data is being compressed. While insignificant for a few files, for hundreds and thousands of archives, this extra space savings can be very useful.

More than just efficient archival of BED data, Starch also precomputes useful statistics for sequencing analysis, providing instantaneous access to element counts, base counts, and unique base counts on a per-chromosome and whole archive basis.

Starch also provides a way to tag archives with open-ended metadata that can be used for marking them up with useful, descriptive information about the experiment, data provenance, lab equipment, etc. I don't believe other formats provide these options.

As always, you should choose the data formats that work best for your workflow.

Regards,
Alex

Alex Reynolds <alexpr...@gmail.com> Dec 23 06:25PM -0800

> It would be great if bedops (and/or bedtools) would support the various binary UCSC formats (bigBed and bigWig).
> As those formats already contain sorted content, I guess not much code of bedops would need to change.

A core design consideration for BEDOPS is the use of UNIX pipes to handle data flow, particularly standard input and output. Piping makes analysis pipeline design easy and improves performance, by reducing the amount of (slow) disk I/O.

There are already tools that process bigBed and bigWig data into BED, and so it should be trivial to pipe the output of those tools into BEDOPS tools.

Specifically, bedops, bedmap and other BEDOPS tools use the hyphen character to replace a filename with standard input.

So given commands called "foo" and "bar" which print unsorted BED data to standard output, you can easily do the following:

$ foo someData.foo | sort-bed - | bedops --range 1000 --everything - > somePaddedData.bed

Or:

$ bar someData.bar | sort-bed - | bedmap --echo --echo-map-id - anotherDataset.bed > someDataWithOverlappingIDs.bed

For a real-world application, consider GFF records that you want to count the number of overlaps with BAM-formatted reads. We can use BEDOPS conversion scripts to handle this analysis:

$ bam2bed < reads.bam > reads.bed
$ gff2bed < records.gff | bedmap --echo --count - reads.bed > answer.bed

Using bigBed/bigWig or other format conversion tools that write to standard output, you can integrate their data in similar ways with BEDOPS tools.

Regards,
Alex

Reply all

Reply to author

Forward

0 new messages