Filtering multiple samples with intersectbed

3,457 views
Skip to first unread message

Øyvind Busk

unread,
Sep 27, 2011, 8:27:44 AM9/27/11
to bedtools-discuss
Hi,

When working with multiple samples, you would sometimes want to report
instances that are present in e.g. 7 of a total of 10 samples. Is
there an easy way of doing this with BEDtools, or do I need to script
it?

PS: thanks for a great set of tools.

Øyvind

Aaron Quinlan

unread,
Sep 27, 2011, 8:59:37 AM9/27/11
to bedtools...@googlegroups.com
Dear Øyvind,

Could you please elaborate a bit on exactly what you are trying to accomplish? For example, are your "samples" represented in distinct BED files? What defines shared presence? A single overlapping interval between two files?

It would be useful if you could provide a toy example of your input data and your desired result.

Thanks,
Aaron

Øyvind Busk

unread,
Sep 27, 2011, 9:20:42 AM9/27/11
to bedtools-discuss
Dear Aaron,
Thanks for quick reply.

A scenario:
We want to sequence the exome of e.g. 4 affected patients in the same
family all suffering from the same disease, and identify related snps.
If I want to look for a mutation present in all four samples,
I would just use intersectbed and pipe. If, however we want to find
snps that are present in AT LEAST three out of four patients, then
what do we do?

We work with snps.txt-files from CASAVA, which I process into .BED-
files. In the example data below, the interval marked with ** is
present in 4 of 4 samples, and would be reported by using intersectbed
on all four samples.
The interval marked with * is only present in three of four, and would
not be reported by using intersectbed on all four samples. So is there
an easy way of intersecting these files,
and having i report which contains intervals present in 4 and 3 of the
files?

1.BED
chr1 99 100 *
chr2 87 88
chr3 49 50 **

2.BED
chr1 99 100 *
chr2 46 47
chr3 49 50 **

3.BED
chr1 99 100 *
chr9 55 56
chr3 49 50 **

4.BED
chr1 149 150
chr10 50 51
chr3 49 50 **

Aaron Quinlan

unread,
Sep 27, 2011, 9:39:23 AM9/27/11
to bedtools...@googlegroups.com
Hi Øyvind,

I should start by saying that this problem is well suited to the VCF format, which, for each SNP, will record a genotype for each sample. This allows one to compute frequencies rather easily.
That said, if you are bound to BED files for each sample, I believe the unionBedGraphs tool can tackle this. Below I have cpnverted your BED files to BEDGRAPH format by simply adding a "1" as the fourth column of each entry in each file. UnionBedGraphs will find all common and unique intervals (in your case, SNPs) among multiple BEDGRAPH files and report the "depth" (in your case always a 1 or 0 for presence or absence) of each sample at each discrete interval.

See below for an example using your data. Please NOTE that unionBedGraphs expects proper BEDGRAPH format which requires that your input files are sorted by chrom/start and that there are NO overlapping intervals.

$ cat 1.bed
chr1 99 100


chr2 87 88
chr3 49 50

$ cat 2.bed
chr1 99 100


chr2 46 47
chr3 49 50

$ cat 3.bed
chr1 99 100


chr9 55 56
chr3 49 50

$ cat 4.bed


chr1 149 150
chr10 50 51
chr3 49 50

# convert to BEDGRAPH by chrom/pos sorting and adding a "1" to reflect presence of the snp in each sample
$ sort -k1,1 -k2,2n 1.bed | awk '{print $0"\t"1}' > 1.bedg
$ sort -k1,1 -k2,2n 2.bed | awk '{print $0"\t"1}' > 2.bedg
$ sort -k1,1 -k2,2n 3.bed | awk '{print $0"\t"1}' > 3.bedg
$ sort -k1,1 -k2,2n 4.bed | awk '{print $0"\t"1}' > 4.bedg

$ cat 1.bedg
chr1 99 100 1
chr2 87 88 1
chr3 49 50 1
$ cat 2.bedg
chr1 99 100 1
chr2 46 47 1
chr3 49 50 1
$ cat 3.bedg
chr1 99 100 1
chr3 49 50 1
chr9 55 56 1
$ cat 4.bedg
chr1 149 150 1
chr10 50 51 1
chr3 49 50 1

$ unionBedGraphs -i 1.bedg 2.bedg 3.bedg 4.bedg -names s1 s2 s3 s4 -header
chrom start end s1 s2 s3 s4
chr1 99 100 1 1 1 0
chr1 149 150 0 0 0 1
chr10 50 51 0 0 0 1
chr2 46 47 0 1 0 0
chr2 87 88 1 0 0 0
chr3 49 50 1 1 1 1
chr9 55 56 0 0 1 0

Reply all
Reply to author
Forward
0 new messages