[bedtools-discuss] Duplicates

4,981 views
Skip to first unread message

Adrian Johnson

unread,
Apr 21, 2010, 11:55:32 AM4/21/10
to bedtools...@googlegroups.com
Hi Aaron:

Is it possible to remove duplicates in BED-file using Bedtools. A
little clean-up function.

thanks
Adrian


--
Subscription settings: http://groups.google.com/group/bedtools-discuss/subscribe?hl=en

Aaron Quinlan

unread,
Apr 21, 2010, 12:09:11 PM4/21/10
to bedtools...@googlegroups.com
This depends on how you define duplicates.

If you want to remove entries where _every_ column is identical, you could do this with GNU sort and uniq as follows:
$ sort <BED> | uniq > BED.noDuplicates

If you want to remove entries where the coordinates are the same but for example, the names are different, things are a bit more difficult. You would have to write a little Perl/Python/Awk script to do this OR, if you know that the features do not overlap one another, you could use mergeBed.

Aaron

Gordon, Assaf

unread,
Apr 21, 2010, 12:16:24 PM4/21/10
to bedtools...@googlegroups.com
If I may suggest a similar solution:

GNU sort has a '-u' option to output only unique values for the
columns used for sorting (without the need to pipe through the uniq
program).

Example:
to output only unique lines based on chrim/start/end (columns 1/2/3),
run:

sort -k1,1 -k2,2n -k3,3n -u

If you want to include the strand as well, add column 6 to the sort key:

sort -k1,1 -k2,2n -k3,3n -k6,6 -u

other columns which are not used as sort keys (e.g name/column 4) will
not affect the uniqueness: if two lines have the same coordinates but
different names - only one of them will be printed.

-gordon


On Apr 21, 2010, at 12:10 PM, "Aaron Quinlan" <aaronq...@gmail.com>
wrote:

Aaron Quinlan

unread,
Apr 21, 2010, 12:22:55 PM4/21/10
to bedtools...@googlegroups.com
Excellent, thanks Gordon!

Aaron

Adrian Johnson

unread,
Apr 21, 2010, 12:30:00 PM4/21/10
to bedtools...@googlegroups.com
Thanks Gordon and Aaron. Works super!
-Adrian
Reply all
Reply to author
Forward
0 new messages