[bedtools-discuss] Bedtools: intersect more than 2 files

2,396 views
Skip to first unread message

Aaron Quinlan

unread,
Oct 26, 2011, 9:27:38 AM10/26/11
to Sigrid Laure ROUAM, bedtools...@googlegroups.com
Hi Sigrid,

I have copied my response to your inquiry to the bedtools mailing list in case others have wondered the same thing.

In short, no there is no single command that will do this for _all_ types of BED files.  The unionCoverageBed tool can be used to do this for sorted BEDGRAPH files that have non-overlapping intervals.  For example, the following illustrates an example if your interval files are non-overlapping (if not, you could first use mergeBed -n on each file):

 $ cat 1.bg
 chr1  1000    1500    10
 chr1  2000    2100    20

 $ cat 2.bg
 chr1  900 1600    60
 chr1  1700    2050    50

 $ cat 3.bg
 chr1  1980    2070    80
 chr1  2090    2100    20

 $ unionBedGraphs -i 1.bg 2.bg 3.bg
 chr1  900 1000    0   60  0
 chr1  1000    1500    10  60  0
 chr1  1500    1600    0   60  0
 chr1  1700    1980    0   50  0
 chr1  1980    2000    0   50  80
 chr1  2000    2050    20  50  80
 chr1  2050    2070    20  0   80
 chr1  2070    2090    20  0   0
 chr1  2090    2100    20  0   20

Now, I have been thinking about a general solution for multiple BED files with potentially overlapping intervals, but don't yet have a solution I feel is worth sharing.  The only other current option in BEDTools is to either create a large piped command such as:

$ intersectBed -a 1.bed -b 2.bed | intersectBed -a - -b 3.bed | intersectBed -a - -b 4.bed {ETC.]

Alternatively, if you use Python, the pybedtools library has a nice approach for this, where the "+" operator is overloaded to do interval intersection:

from pybedtools import BedTool

file1 = BedTool("1.bed")
file2 = BedTool("2.bed")
file3 = BedTool("3.bed")
file4 = BedTool("4.bed")

common_intervals = (file1 + file2 + file3 + file4)

I hope this provides some potential options.
Best,
Aaron



On Oct 25, 2011, at 1:20 AM, Sigrid Laure ROUAM wrote:

Hello,

I am currently using bedtools and am really satisfied with it. It is a great tool.
I was wondering if it is possible to intersect more than 2 files at a time. For example I would like to know what are the common regions in 70 different file.

Thank you.

Best Regards,
--
Sigrid Rouam, PhD
Post Doctoral Fellow
Computational and Mathematical Biology 4
Genome Institute of Singapore
+65 6808 8195

Ryan Dale

unread,
Oct 26, 2011, 10:39:54 AM10/26/11
to bedtools...@googlegroups.com
To extend Aaron's pybedtools example, the following will intersect together all *.bed files in a directory called "beds" and print the resulting BED file to stdout:

from pybedtools import BedTool
from glob import glob
fns = glob("beds/*.bed")
common_intervals = BedTool(fn[0])
for fn in fns[1:]:
    common_intervals += BedTool(fn)
print common_intervals


-ryan

Brent Pedersen

unread,
Oct 26, 2011, 10:43:18 AM10/26/11
to bedtools...@googlegroups.com
completely unnecessary, but just out of curiosity, Ii wonder if:

from pybedtools import BedTool
import operator

common = reduce(operator.add, (BedTool(f) for f in fns))

would work as well.

ara...@bu.edu

unread,
Jan 1, 2015, 8:31:58 PM1/1/15
to bedtools...@googlegroups.com, rou...@gis.a-star.edu.sg
Hi BEDTool users,

I just wanted to reply to this post to share how I usually get the peak intersection from 3 or more BED files of peaks.  In other words, I typically work with many samples (at least 20), each with their respective discovered peak list.  We want to know which genomic regions are overlapping with all the samples' peak list (the peak intersection among many samples).  Here's my approach, I'd welcome any feedback about whether this is a correct approach or if there maybe a more efficient method:

It's understood that for each sample 1 to sample n: there will be common peaks + unique peaks

(Step 1) Obtain the peak union (common peaks + unique peaks) from all the samples (sample 1 to sample n)
Concatenate all the BED files of peaks to a single file
Then do a bedtools merge to get the peak union

(Step 2) Obtain the peak intersection
Create a list of BED files
Use a for loop over this list of BED files to do intersectBed between the peak union and each sample specific BED file
Except that the output of each intersectBed command then gets used as the input for the next iteration in the for loop
In other words:
The 1st iteration of the loop will give sample 1 common and unique peaks
The 2nd iteration of the loop will give (sample 1 common and sample 2 common) (filtered out sample 1 unique and sample 2 unique)
The 3rd iteration of the loop will give (sample 1 common, sample 2 common, and sample 3 common) (filtered out sample 3 unique)
etc...

The end result is a peak intersection list (the max number of features in the peak intersection is equal to the sample peak list with the fewest features).

Question: Is this a good/correct approach to get the peak intersection with 3 or more BED files of peaks?

Thanks,
Andy

Aaron Quinlan

unread,
Jan 1, 2015, 8:56:33 PM1/1/15
to bedtools...@googlegroups.com, rou...@gis.a-star.edu.sg
Hi Andy,

Have you tried the (poorly documented) "multiinter" tool?  The best examples can be found in the second answer here.

Let me know if this is not what you are looking for.

Best,
Aaron

--
You received this message because you are subscribed to the Google Groups "bedtools-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bedtools-discu...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ara...@bu.edu

unread,
Jan 4, 2015, 10:49:52 AM1/4/15
to bedtools...@googlegroups.com, rou...@gis.a-star.edu.sg
Hi Aaron,

Thanks for the suggestion.  I just tested the multiIntersectBed and it does in fact give me the intersection regions from multiple BED files!  The example provided on the Biostars page is very helpful to understand the tool.  I do have 1 question:

How are the "intervals" determined when the user calls multiIntersectBed? From the example, it looks like each peak region is split into different parts (intervals) (sort of like doing the reverse of a mergeBed command). After the peak is split it looks like each peak section is interrogated to identify which samples contain that peak section. 

I ended up extracting the genomic regions that were present in all the samples (used an awk if statement on the 4th column).  I also found that there were very intervals (1bp, 2bp, 3bp wide) that by definition are present in all the samples but we don't consider them to be peaks (most likely flanking peak regions).  I filter these out by implementing a width minimum of 100bp.

What's nice about the output from multiIntersectBed is that it's immediately apparent that flanking peak regions (intervals) are only present in a few of the samples while the center peak region is present in all the samples.  This indicates there's some noise in the flanking peak region during peak discovery for each sample but the center of the peak is robust enough to be present in all samples.

I did some overlap analysis between the intersect that I get from the process I described previously and from
multiIntersectBed and I found 100% and 96% common peak percentages.  I would rather use the
multiIntersectBed since it yields more precise genomic regions that are common  between all samples.

Thanks for making this tool available,
Andy

Tommy Tang

unread,
Oct 12, 2015, 10:59:39 AM10/12/15
to bedtools-discuss
Hi Ryan,

If I want to do similar stuff with bedpe files. How can I use the "+" operator for pair_to_pair?
do I need to do operator overloading for it as well?

Thank,
Ming

Ryan Dale

unread,
Oct 13, 2015, 9:46:59 AM10/13/15
to bedtools...@googlegroups.com
Hi Ming -

If a and b are both BedTool objects, then using the `+` operator like this:

    a + b

is just a shortcut for this:

    a.intersect(b, u=True)

So in the example below, the line:

    common_intervals += BedTool(fn)

is doing the same thing as:

    common_intervals = common_intervals.intersect(fn, u=True)

To work on BEDPE files, you can replace `intersect` with `pair_to_pair` and add whatever keyword args you'd like. For example:

    common_intervals = common_intervals.pair_to_pair(fn, slop=100)

Hope that helps.

-ryan

Tommy Tang

unread,
Oct 13, 2015, 11:11:08 AM10/13/15
to bedtools-discuss
Thanks Ryan !

Ming
Reply all
Reply to author
Forward
0 new messages