Could you clarify what you mean by "population-wide gaps"? Are you comparing BEDGRAPH files for each sample and looking for gaps in coverage among your cohort?
I see the issue you are facing w.r.t. running three passes, but if each file is a BEDGRAPH, I think the unionBedGraphs tool will do what you want.
That said, I may have completely misinterpreted the problem. Could you give a small example of the sample files and the expected output?
Best,
Aaron
A comm-like solution is quite difficult given the way that BEDTools is currently architected. Basically, the routines are driven by the "A" file and the "B" file is merely used for screening overlaps. I am working (slowly) on new algorithms that would treat A and B as peers and thus support such a tool. I hope for this to be available by the end of the year.
In the interim, based on my understanding of your problem, I believe that unionBedGraphs would work for you. Imagine you have 3 gap files in BEDGRAPH format: 1.gap.bedg, 2.gap.bedg, and 3.gap.bedg. The intervals in each file are the _non-overlapping_ gap locations and the fourth column is a "1" for "gap present".
$ cat 1.gap.bedg
chr1 100 150 1
chr1 200 210 1
$ cat 2.gap.bedg
chr1 90 160 1
chr1 170 205 1
$ cat 3.gap.bedg
chr1 198 207 1
chr1 209 210 1
Running unionBedGraphs on these three files will return the common and unique intervals (see below). The requirements are that:
a) each bedg file must be sorted by chrom, then start (sort -k1,1 -k2,2n)
b) the intervals in each bedg must be non-overlapping. that is, you should merge your gaps in each file beforehand.
$ unionBedGraphs -i 1.gap.bedg 2.gap.bedg 3.gap.bedg -header -names 1 2 3
chrom start end 1 2 3
chr1 90 100 0 1 0
chr1 100 150 1 1 0
chr1 150 160 0 1 0
chr1 170 198 0 1 0
chr1 198 200 0 1 1
chr1 200 205 1 1 1
chr1 205 207 1 0 1
chr1 207 209 1 0 0
chr1 209 210 1 0 1
Best,
Aaron
Aaron Quinlan, Ph.D.
Assistant Professor
Center for Public Health Genomics
Department of Public Health Sciences
University of Virginia
West Complex
Box 800717, Charlottesville, VA 22908
(T) 434.243.1669 (F) 434.924.1312
cphg.virginia.edu/quinlan
ar...@virginia.edu
This is indeed an annoyance. Short of writing a custom tool or creating a small "mapping" file that define sort order (e.g. "chr1"=>1), the only thing I know that works well is to just use numeric chromosomes for all BED files. Internally, for most of my work I make chr1-ch22 be 1-22 and X,Y,M, be 23-25, etc. Then sorting becomes simple. The consequence is that you have to convert all interval files you work with.
If that is not palatable, I would create a mapping file and a small conversion script that uses the mapping file to do the sort appropriately using GNU sort on the temporarily-added sort number.
Best,
Aaron
The solution to this sorting is actually very simple:
On the command line, sort (since version 7.0) supports "version sort order" (sometimes called "natural sort") which does exactly what you need:
====
$ cat chrom.txt
chr1
chrUn_gl000232
chrY
chr2
chr13
chrM
chrUn_gl000218
chr6_hap
chr2R
chr16
chr10
chr6_dbb_hap3
chr4
chr3L
chr4_ctg9_hap1
chr3R
chr3
chrX
$ sort -k1,1V chrom.txt
chr1
chr2
chr2R
chr3
chr3L
chr3R
chr4
chr4_ctg9_hap1
chr6_dbb_hap3
chr6_hap
chr10
chr13
chr16
chrM
chrUn_gl000218
chrUn_gl000232
chrX
chrY
====
Programming wise,
I use the following code to sort in "natural order":
http://cancan.cshl.edu/labmembers/gordon/files/natsort.tar.bz2
Inside you'll find code for C's qsort and for STL's vector - adaptable for almost every situation...
Hope this helps,
-gordon
P.S.
Depending on which unix flavor you're using "sort" might be very old - you might have to download and compile the latest (8.10) from source.
Best,
Aaron