bedtools closest: v21 performance, v22 crashes

89 views
Skip to first unread message

skat...@ucsc.edu

unread,
Nov 21, 2014, 3:19:23 PM11/21/14
to bedtools...@googlegroups.com
Dear Aaron,

I found a severe performance problem with "bedtools closest" version 2.21.0,
specifically when using "-iu" or "-id".  (also using "-s -D b -t first")

Then I saw the update to version 2.22.0 that requires sorted input. So I upgraded to that version.

Using the identical (sorted) input to the two versions, I get various problems with version 2.22.0
a) segmentation faults using iu/id
b) in one case, (faster but) different results when not using iu/id
c) in another case, an apparent hang when not using iu/id

I can send you the input bed files. Below are table summaries of my results and other supporting background.

/Sol Katzman
-------------------------------------------

I categorized the output files by examining the final output column (awk $NF), which should
represent whether the closest is
a) overlap ( =0)
b) downstream ( > 0)
c) upstream (< -1)
d) not found (= -1)  [might include some downstream cases]

The "u-time" column is from the output of a "time bedtools closest..." command

With version 2.21.0 you can see that when there are many items to be ignored,
the processing time increases by orders of magnitude.

You can also see the different results (for one comparison) between the 2 versions.

One interesting aspect of the input bed files is that the "chrom" field is actually
(an Ensembl) gene identifier. We are trying to find the closest feature within a gene locus.
In the a1/b1 files, the "start/end" fields are actually chromosome coordinates.
In the a2/b2 files, the "start/end" fields are gene transcript coordinates.

Of course, none of that should matter, but it might be the cause of the problems with version 2.22.0
(lots of distinct "chrom" values, with long names).

Here are the results of 2 runs (a1 vs b1 and a2 vs b2) using or not using id/iu:

v2.21.0
a1.b1    -1   <-1    =0   >0  total u-time
not.id    0   585  1204 1303   3092    2.4
yes.id  402  1486  1204    0   3092  486.1
yes.iu    0     0  1204 1888   3092    2.4
                        
a2.b2    -1   <-1    =0   >0  total u-time
not.id   23   640  1206  374   2243    1.9
yes.id   35  1002  1206    0   2243   13.4
yes.iu  402     0  1206  635   2243  364.2

v2.22.0
a1.b1    -1   <-1    =0   >0  total u-time
not.id    2   963  1203  924   3092    0.5
yes.id                              segFault
yes.iu                              segFault
                        
a2.b2    -1   <-1    =0   >0  total u-time
not.id                              hang (45+ mins)
yes.id                              segFault
yes.iu                              segFault

Here are a couple of typical commands, along with some other supporting information:

$ bedtools sort -i a1.bed > a1.sorted.bed
$ bedtools sort -i a2.bed > a2.sorted.bed
$ bedtools sort -i b1.bed > b1.sorted.bed
$ bedtools sort -i b2.bed > b2.sorted.bed

$ set BEDv22ROOT = $BEDTOOLSROOT/../../bedtools-2.22.0/bin
$ set BEDv21ROOT = $BEDTOOLSROOT/../../bedtools-2.21.0/bin

$ $BEDv22ROOT/bedtools closest |& grep Version
Version: v2.22.0
$ $BEDv21ROOT/bedtools closest |& grep Version
Version: v2.21.0

$ time $BEDv21ROOT/bedtools closest -a a1.sorted.bed -b b1.sorted.bed -s -D b -t first     > v21out.a1.b1.not.id.sorted.bed
2.394u 0.248s 0:02.73 96.3%     0+0k 0+0io 0pf+0w
$ time $BEDv21ROOT/bedtools closest -a a1.sorted.bed -b b1.sorted.bed -s -D b -t first -id > v21out.a1.b1.yes.id.sorted.bed
486.116u 0.633s 8:10.75 99.1%   0+0k 0+0io 0pf+0w
$ time $BEDv21ROOT/bedtools closest -a a1.sorted.bed -b b1.sorted.bed -s -D b -t first -iu > v21out.a1.b1.yes.iu.sorted.bed
2.372u 0.257s 0:02.76 94.9%     0+0k 0+0io 0pf+0w
...
$ time $BEDv22ROOT/bedtools closest -a a1.sorted.bed -b b1.sorted.bed -s -D b -t first     > v22out.a1.b1.not.id.sorted.bed
0.546u 0.035s 0:00.58 98.2%     0+0k 0+0io 0pf+0w
$ time $BEDv22ROOT/bedtools closest -a a1.sorted.bed -b b1.sorted.bed -s -D b -t first -id > v22out.a1.b1.yes.id.sorted.bed
Segmentation fault (core dumped)
$ time $BEDv22ROOT/bedtools closest -a a1.sorted.bed -b b1.sorted.bed -s -D b -t first -iu > v22out.a1.b1.yes.iu.sorted.bed
Segmentation fault (core dumped)

...

$ cat v22out.a1.b1.not.id.sorted.bed | gawk '($NF == -1)' | wc -l

$ cat v22out.a1.b1.not.id.sorted.bed | gawk '($NF <  -1)' | wc -l
963 
$ cat v22out.a1.b1.not.id.sorted.bed | gawk '($NF ==  0)' | wc -l
1203 
$ cat v22out.a1.b1.not.id.sorted.bed | gawk '($NF >   0)' | wc -l
924
...
$ wc -l *out*.bed
   3092 v21out.a1.b1.not.id.sorted.bed
   3092 v21out.a1.b1.yes.id.sorted.bed
   3092 v21out.a1.b1.yes.iu.sorted.bed
   2243 v21out.a2.b2.not.id.sorted.bed
   2243 v21out.a2.b2.yes.id.sorted.bed
   2243 v21out.a2.b2.yes.iu.sorted.bed
   3092 v22out.a1.b1.not.id.sorted.bed


Aaron Quinlan

unread,
Nov 21, 2014, 3:47:35 PM11/21/14
to bedtools...@googlegroups.com, Kindlon, Neil (nek3d)
Hi Sol,

Thanks for reporting this. Motivated by the inherent performance limitations of the tree-based algorithm, version 2.22 uses a "sweeping" algorithm that exploits the obvious advantages of detecting "closest" intervals when datasets are sorted by chrom and start coordinate. This new version is much faster than the old. 

However, it looks like our test coverage was incomplete. We will look into fixing this issue a.s.a.p.  Thanks much for reporting it.

Aaron

--
You received this message because you are subscribed to the Google Groups "bedtools-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bedtools-discu...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

skat...@ucsc.edu

unread,
Nov 21, 2014, 5:32:27 PM11/21/14
to bedtools...@googlegroups.com, skat...@ucsc.edu
Dear Aaron,

While I have your attention, perhaps you could fix a couple of documentation issues.

Based on the numbers in my tables, I believe that when "-D b" is specified, then iu/id
means ignore features in B that are downstream/upstream of A, which is the opposite
of what the help says. See the zero values in my tables for "<-1" and ">0" offsets
when run with iu/id

I have not tested with "-D a" nor "-D ref", but I suspect that the true behavior is:

"When iu/id is specified, ignore features that would have a negative/positive offset."

Another minor point is that when nothing valid is found the output is not chrom "none";
rather, the numeric fields are all -1, the string fields are "."

In my experience, the output for nothing valid found is:
.       -1      -1      .       -1      .       -1

Finally, if nothing valid is found, the "-1" output for -D/-d (last column) would be better as ".", since -1 is a valid value.

In fact, if all the outputs were "." it would be fine in such cases.

/Sol.


Aaron Quinlan

unread,
Nov 21, 2014, 5:49:38 PM11/21/14
to bedtools...@googlegroups.com, skat...@ucsc.edu
We will look into it. Thanks very much.

- Aaron

skat...@ucsc.edu

unread,
Jan 15, 2015, 7:23:37 PM1/15/15
to bedtools...@googlegroups.com, skat...@ucsc.edu
Dear Aaron,

To close the loop on this, the problems are resolved with bedtools Version v2.22.1

I reran my tests, and comparing v2.21.0 with v2.22.1 and I get the tables shown below.

The performance with the new algorithm running on the pre-sorted files is radically
improved for the -id and -iu cases.

The new results are nearly identical to the old algorithm. (One case moved from the ">0" offset to the "-1" offset,
but I am not going to nitpick. I haven't looked to see which one is "correct".)

Thanks as always for the wonderful BEDTools suite!

/Sol Katzman

---------------------------------------------

v2.21.0

a1.b1    -1   <-1    =0   >0  total u-time
not.id    0   585  1204 1303   3092    2.4
yes.id  402  1486  1204    0   3092  486.1
yes.iu    0     0  1204 1888   3092    2.4
                        
a2.b2    -1   <-1    =0   >0  total u-time
not.id   23   640  1206  374   2243    1.9
yes.id   35  1002  1206    0   2243   13.4
yes.iu  402     0  1206  635   2243  364.2
                        
---------------------------------------------
v2.22.1


a1.b1    -1   <-1    =0   >0  total u-time
not.id    0   585  1204 1303   3092    1.5
yes.id  402  1486  1204    0   3092    1.6
yes.iu    0     0  1204 1888   3092    1.6

                        
a2.b2    -1   <-1    =0   >0  total u-time
not.id   24   640  1206  373   2243    1.3
yes.id   35  1002  1206    0   2243    1.3
yes.iu  402     0  1206  635   2243    1.3
                        

Aaron Quinlan

unread,
Jan 19, 2015, 9:37:20 AM1/19/15
to bedtools...@googlegroups.com, skat...@ucsc.edu
Hi Sol,

Thanks so much for following up with this and I am glad that things are working well now.  Your bug report was extremely helpful, thank you.  The next release will also include your suggestion to use a “.” instead of “-1” for NULL distances.

- Aaron 

Reply all
Reply to author
Forward
0 new messages