[bedtools-discuss] Version 2.13.2

Aaron Quinlan

unread,

Sep 23, 2011, 12:44:20 PM9/23/11

to bedtools...@googlegroups.com

Hello all,

I just posted version 2.13.2 on the google code site: http://code.google.com/p/bedtools/downloads/list.

(1). New --sorted option for intersectBed. I implemented a new algorithm (I call it the "chromsweep" algorithm) for detecting overlaps between files that are sorted by chrom and start position (reference-ordered in GATK parlance). "Chromsweep" has some important advantages for the long-term survival of BEDTools as datasets continue to grow in size. It marches from the beginning to the end of the chromosome and reports overlaps on the fly. In so doing, it requires essentially zero memory. This is in contrast to the standard BEDTools algorithm which load the "B" file into a tree structure that stays in memory. In addition, it is currently (with minimal optimization) 30% faster than the default algorithm, especially for very large files. Lastly, as most genomics files are already sorted in this way by the time users work with bedtools, the cost of sorting is amortized by the number of downstream analyses.

Currently, the -sorted option only works with reference ordered BED/GFF/VCF files. BAM support will come soon. Over the coming months, I will integrate this algorithm into more of the tools. If your datasets are "normal" in size (i.e., <10 million features), the default algorithms are typically just fine. However, if you have tons of features with many columns, a substantial amount of memory can be used by the default algorithm. In such cases, the -sorted option is a much better option. One must ensure that the files are sorted in exactly the same way, e.g.

sort -k1,1 -k2,2 A.bed > A.sorted.bed

sort -k1,1 -k2,2 B.bed > B.sorted.bed

intersectBed -a A.sorted.bed -b B.sorted.bed -sorted [OPTIONS] > out

For those that are interested, I plan to write up an informal document about how the algorithm works. Many of you likely already have a detailed sense of how to do this. The issues are that there are some slightly tricky corner cases given that features vary greatly in size and can be "nested" within one another. There is a easier to follow Python implementation here:

https://github.com/arq5x/chrom_sweep/blob/master/chrom_sweep.py

(2). genomeCoverageBed now longer needs a "genome" file for BAM input. It instead uses the BAM header, as it should have from the beginning.

(3). New -scores option in tagBam. Just like the -names option, this allows one to populate BAM tags based on values in the BED "score" field. One potentially important use for this is assign indicative colors to subsets of BAM alignments by placing a RGB color tag in the score field. By using the "YC" tag, one could then visualize these alignments in UCSC and IGV, which each recognize the "YC" tag. For example:

head annotations.bed

chr1 0 100 exon 255,0,0

chr1 1000 1100 microRNA 0,255,0

tagBam -i aln.bam -files annotations.bed -scores > aln.colorized.bam

samtools view aln.colorized.bam

read1 chr1 10 255 ... YC:Z:255,0,0

read1 chr1 1050 255 ... YC:Z:0,255,0

Load into a browser and enjoy. Especially useful when zoomed out and overlaps are hard to see without color coding.

Best,

Aaron

Aaron Quinlan

unread,

Sep 23, 2011, 12:54:33 PM9/23/11

to bedtools...@googlegroups.com

Forgot to mention it, but I also fixed a bug that was injected caused spurious overlaps with "split" bam alignments.

Aaron

Aaron Quinlan

unread,

Sep 23, 2011, 1:16:56 PM9/23/11

to bedtools...@googlegroups.com

Two more notes:

1. The IGV Early Access version supports YC tags for coloring. It had supported this in the past as well, but it was mistakenly dropped for a few versions.

2. The tagBam example should have also include the following to populate a "YC" tag instead of a "YB": -tag YC