how to simplify / merge junction calls in bedpe format

948 views
Skip to first unread message

Stéphane Plaisance

unread,
Jun 21, 2011, 8:37:11 AM6/21/11
to bedtools-discuss
Hi Aaron and Co,

I have 110'000 junctions lines from a number of genomes and wish to
merge all junctions which show overlap on both ends
eg:
...
chr1 10156654 10157044 chr1 10157986
10158302 merged-junction 0 + +
chr1 10171783 10172208 chr1 10174523
10174754 merged-junction 0 + +
chr1 10171788 10172226 chr1 10174518
10174772 merged-junction 0 + +
chr1 10171844 10172208 chr1 10174446
10174754 merged-junction 0 + +
chr1 10171896 10172226 chr1 10174508
10174772 merged-junction 0 + +
...

in the example, the last 4 junctions overlap although they are not
exactly identical.

Is there a mergeBed(pe) command or can I 'fake' it somehow?
I'd like to be able to merge in a conservative way by keeping the
largest interval (10171783-10172226 // 10174446-10174772)
or in a restrictive manner keeping the minimal overlap.

Thanks in advance,
Stephane

Aaron Quinlan

unread,
Jun 21, 2011, 9:36:08 AM6/21/11
to bedtools...@googlegroups.com
Hi Stephane,

There is no merging functionality for BEDPE files. I think your best option would be to use pairToPair to find junctions that overlap on both ends (the default), and then write a script around the output to merge the coordinates. I recognize that this is a non-trivial problem, as I have faced it myself. Essentially, what you are try to do is equivalent to structural variant clustering algorithms such as VariationHunter or my own Hydra. If the former solution is inadequate, you could consider using Hydra on the BEDPE input (you'd have to add a few dummy columns) to cluster the junctions.

Best,
Aaron

Tommy Tang

unread,
Oct 12, 2015, 10:59:39 AM10/12/15
to bedtools-discuss
Hi Aaron,

I have a similar task as this.

I have structural variants calls in bedpe format. I want to find svs that with both ends overlapping.
Say, I have 50 such bedpe.
Now, my workaround is using bedops
1. split the bedpe to 2 bed files

50 beds:
chr1 start1 end1 sample_1_id1    
chr1 start1 end1 sample_2_id2               
.......

50 beds:
chr2 start2 end2 sample_1_id1
chr2 start2 end2 sample_2_id2
.......

2. merge 50 beds respectively (now I have 2 bed files only) and use bedmap

 cat merged.bed1 | bedmap --echo --echo-ref-name --echo-map-id-uniq --delim "\t" - 


 cat merged.bed2 | bedmap --echo --echo-ref-name --echo-map-id-uniq --delim "\t" - 


3. sort the 2 bed files according to the id column and then check 

any common ids were overlapped in the last column (added by bedmap)


Any simpler way to do it?

Also, in this approach I have to make sure that the same row in the bedpe format always have the 

left most breakpoint in the first block and the right most breakpoint in the second block.


Moreover, bedmap does not take strand into consideration. 

I actually have to split one bedpe to 4 bed files based on strandness (++,+-,-+,--)


I realized that it is a common task (find recurrent svs in all samples) to do when one has the sv calls. 

I am using speedseq to get the sv calls and convert them to bedpe.

what's your suggestion to do this?


Thanks very much!


Ming





Aaron Quinlan

unread,
Oct 12, 2015, 4:36:48 PM10/12/15
to bedtools...@googlegroups.com
Hi Ming,

Have you tried “bedtools pairtopair -a a.bedpe -b b.bedpe -type both”?

Aaron

--
You received this message because you are subscribed to the Google Groups "bedtools-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bedtools-discu...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tommy Tang

unread,
Oct 13, 2015, 11:11:08 AM10/13/15
to bedtools-discuss
Hi Aaron.

Yes, I have used pairTopair.
The problem is that I have 50 such bedpe files, pairTopair can only be used for 2.
I have to do a double loop to do all the possible intersections.

A simple question to ask is how many SVs are shared in more than 2, 3 or 4 bedpe files.
exactly the same problem for intersecting multiple bed files as you mentioned here https://www.biostars.org/p/13516/

I want to know If you have any better solutions

Thanks!
Ming
Reply all
Reply to author
Forward
0 new messages