GNU-comm like tool to do multi-intersect/subtract

217 views
Skip to first unread message

Stéphane Plaisance

unread,
Mar 22, 2011, 8:35:41 AM3/22/11
to bedtools-discuss
Dear Aaron,

I am busy extracting population-wide gaps and therefore wish to
compare 100 full genomes to obtain the full gap table.

To do so, I did not find a better way than using intersectBed and
subtractBed in serial iterative manner, which works for at least 20
but is very time (and RAM) consuming and might crash for greater
numbers due to memory overflow.

What I currently do is:

==>store the first 4-col BED file in sortedA.bed

store every consecutive BED file in sortedB and

loop {
==> intersectBed -a sortedA.bed -b sortedB.bed > a_and_b.bed
==> subtractBed -a sortedA.bed -b sortedB.bed > a_only.bed
==> subtractBed -b sortedA.bed -a sortedB.bed > b_only.bed

==> merge the three result BED files and sort then into sortedA.bed
(after adding ',1' or ',0' to column 5 according to the source)
} # start over with next B until all done

the final result looks like this with 40+ million lines
chrX 0 2 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
chrX 2 3 1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1
chrX 3 4 1,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,1,1
...

Would you have any executable which could do the job in one pass
instead of my three passes (a bit like what GNU 'comm' does):
spitting 'A-only', 'commonAB' and 'B-only' out to 3 separate files as
it goes? This would do wonders for me.

Thanks in advance
Stephane

Aaron Quinlan

unread,
Mar 22, 2011, 9:30:49 AM3/22/11
to bedtools...@googlegroups.com
Hi Stephane,

Could you clarify what you mean by "population-wide gaps"? Are you comparing BEDGRAPH files for each sample and looking for gaps in coverage among your cohort?

I see the issue you are facing w.r.t. running three passes, but if each file is a BEDGRAPH, I think the unionBedGraphs tool will do what you want.

That said, I may have completely misinterpreted the problem. Could you give a small example of the sample files and the expected output?

Best,
Aaron

Stéphane Plaisance

unread,
Mar 23, 2011, 6:22:34 AM3/23/11
to bedtools-discuss
Hi Aaron,

Actually, I have a gap.bed file for each genome and want to combine
them all into one final bed file
each line in the final bed is like I shown before and you can see
which genomes are gaps at this locus (1) or not gap (0)
the name field contains an array of 0/1 for the whole genome list

Gap is one application but crossing calls would be another for
multiple genomes.
I do not know what the graph command does and will check.

Does this answer your question?

Thanks
Stephane
> > Stephane- Tekst uit oorspronkelijk bericht niet weergeven -
>
> - Tekst uit oorspronkelijk bericht weergeven -

Aaron Quinlan

unread,
Mar 24, 2011, 10:09:09 AM3/24/11
to bedtools...@googlegroups.com
Hi Stephane,

A comm-like solution is quite difficult given the way that BEDTools is currently architected. Basically, the routines are driven by the "A" file and the "B" file is merely used for screening overlaps. I am working (slowly) on new algorithms that would treat A and B as peers and thus support such a tool. I hope for this to be available by the end of the year.

In the interim, based on my understanding of your problem, I believe that unionBedGraphs would work for you. Imagine you have 3 gap files in BEDGRAPH format: 1.gap.bedg, 2.gap.bedg, and 3.gap.bedg. The intervals in each file are the _non-overlapping_ gap locations and the fourth column is a "1" for "gap present".

$ cat 1.gap.bedg
chr1 100 150 1
chr1 200 210 1

$ cat 2.gap.bedg
chr1 90 160 1
chr1 170 205 1

$ cat 3.gap.bedg
chr1 198 207 1
chr1 209 210 1

Running unionBedGraphs on these three files will return the common and unique intervals (see below). The requirements are that:

a) each bedg file must be sorted by chrom, then start (sort -k1,1 -k2,2n)
b) the intervals in each bedg must be non-overlapping. that is, you should merge your gaps in each file beforehand.

$ unionBedGraphs -i 1.gap.bedg 2.gap.bedg 3.gap.bedg -header -names 1 2 3
chrom start end 1 2 3
chr1 90 100 0 1 0
chr1 100 150 1 1 0
chr1 150 160 0 1 0
chr1 170 198 0 1 0
chr1 198 200 0 1 1
chr1 200 205 1 1 1
chr1 205 207 1 0 1
chr1 207 209 1 0 0
chr1 209 210 1 0 1

Best,
Aaron

Aaron Quinlan, Ph.D.
Assistant Professor
Center for Public Health Genomics
Department of Public Health Sciences
University of Virginia
West Complex
Box 800717, Charlottesville, VA 22908
(T) 434.243.1669 (F) 434.924.1312
cphg.virginia.edu/quinlan
ar...@virginia.edu

Stéphane Plaisance

unread,
Mar 26, 2011, 6:42:39 AM3/26/11
to bedtools-discuss
sorry to bother you with this other one, recently (not sure how it was
before) I have problems sorting bed files when 'chr' precedes the
number

'a) each bedg file must be sorted by chrom, then start (sort -k1,1 -
k2,2n)' => chr1 then chr10, 11 ... 2, 20, 21, 22, 3, 4, 5 ...

the only solution I found was the old bedsort from Kent or sort -n -
k1.4 -k2 which puts chromosomes M X and Y upfront
This is weird and more a linux than bed problem and I do not know what
to do about it, any suggestion would be welcome

Thanks
Stephane

Aaron Quinlan

unread,
Mar 26, 2011, 8:08:13 AM3/26/11
to bedtools...@googlegroups.com
Hi Stephane,

This is indeed an annoyance. Short of writing a custom tool or creating a small "mapping" file that define sort order (e.g. "chr1"=>1), the only thing I know that works well is to just use numeric chromosomes for all BED files. Internally, for most of my work I make chr1-ch22 be 1-22 and X,Y,M, be 23-25, etc. Then sorting becomes simple. The consequence is that you have to convert all interval files you work with.

If that is not palatable, I would create a mapping file and a small conversion script that uses the mapping file to do the sort appropriately using GNU sort on the temporarily-added sort number.

Best,
Aaron

Assaf Gordon

unread,
Mar 26, 2011, 8:29:33 AM3/26/11
to bedtools...@googlegroups.com, stephane....@gmail.com
Hello Stephane, Aaron,

The solution to this sorting is actually very simple:
On the command line, sort (since version 7.0) supports "version sort order" (sometimes called "natural sort") which does exactly what you need:

====
$ cat chrom.txt
chr1
chrUn_gl000232
chrY
chr2
chr13
chrM
chrUn_gl000218
chr6_hap
chr2R
chr16
chr10
chr6_dbb_hap3
chr4
chr3L
chr4_ctg9_hap1
chr3R
chr3
chrX

$ sort -k1,1V chrom.txt
chr1
chr2
chr2R
chr3
chr3L
chr3R
chr4
chr4_ctg9_hap1
chr6_dbb_hap3
chr6_hap
chr10
chr13
chr16
chrM
chrUn_gl000218
chrUn_gl000232
chrX
chrY
====


Programming wise,
I use the following code to sort in "natural order":
http://cancan.cshl.edu/labmembers/gordon/files/natsort.tar.bz2

Inside you'll find code for C's qsort and for STL's vector - adaptable for almost every situation...

Hope this helps,
-gordon

P.S.
Depending on which unix flavor you're using "sort" might be very old - you might have to download and compile the latest (8.10) from source.

Aaron Quinlan

unread,
Mar 26, 2011, 8:39:25 AM3/26/11
to bedtools...@googlegroups.com
Ha, fantastic! Time for an upgrade from 5.9...

Best,
Aaron

Reply all
Reply to author
Forward
0 new messages