Duplications after intersectBed

1,297 views
Skip to first unread message

Joshua Ainsley

unread,
Apr 13, 2011, 9:57:39 AM4/13/11
to bedtools-discuss
Hello,
A couple searches didn't show a similar discussion, so I apologize if
this has been dealt with before.
I am using intersectBed to find SNPs that map to exons, but for some
reason I get more lines after intersectBed than I had in the original
file. It appears to put duplicate entries in for some of the SNPs.
Could someone offer an explanation for this?

command used:

intersectBed -a AtoG_sorted.bed -b mm9_exons.bed > AtoG_InExons.bed

wc of both files:

1077 6462 36595 AtoG_InExons.bed
860 5160 29221 AtoG_sorted.bed

head of both files:

==> AtoG_InExons.bed <==
chr1 11102906 11102907 A/G 255 +
chr1 11102906 11102907 A/G 255 +
chr1 38090936 38090937 A/G 255 +
chr1 38090940 38090941 A/G 255 +
chr1 52276802 52276803 A/G 255 +
chr1 52276802 52276803 A/G 255 +
chr1 55052274 55052275 A/G 255 +
chr1 60423509 60423510 A/G 255 +
chr1 60423509 60423510 A/G 255 +
chr1 60423509 60423510 A/G 255 +

==> AtoG_sorted.bed <==
chr1 4814670 4814671 A/G 255 +
chr1 6430872 6430873 A/G 255 +
chr1 6495691 6495692 A/G 255 +
chr1 11102906 11102907 A/G 255 +
chr1 21028114 21028115 A/G 255 +
chr1 30691293 30691294 A/G 255 +
chr1 34109619 34109620 A/G 255 +
chr1 37438464 37438465 A/G 255 +
chr1 38090936 38090937 A/G 255 +
chr1 38090940 38090941 A/G 255 +

Thanks,
Josh

Aaron Quinlan

unread,
Apr 13, 2011, 10:03:48 AM4/13/11
to bedtools...@googlegroups.com
Hi Josh,

By default, intersectBed will report _each_ overlap between A and B. Thus, if a given SNP hits multiple exons (e.g., multiple transcripts for the same refGene or genes on opposite strands), it will report all of them. You can use the "-u" option to suppress this and turn the question into a "yes or no" query. Alternatively, one can use the "-c" option to count the number of overlapping exons.

Lastly, you may want to use the "-wb" option to get a sense of which exons are being reported.

Best,
Aaron

Joshua Ainsley

unread,
Apr 13, 2011, 10:59:08 AM4/13/11
to bedtools-discuss
Aaron,
I knew there was a simple explanation I was overlooking.
Thanks!
Josh

Amin Momin

unread,
Apr 13, 2011, 6:35:47 PM4/13/11
to bedtools-discuss
use an awk script to remove duplicates.

awk '! a[$1" "$2]++' infile.bed > outfile.bed

this should take care of them.

Amin
Reply all
Reply to author
Forward
0 new messages