Dear Aaron,
I found a severe performance problem with "bedtools closest" version 2.21.0,
specifically when using "-iu" or "-id". (also using "-s -D b -t first")
Then I saw the update to version 2.22.0 that requires sorted input. So I upgraded to that version.
Using the identical (sorted) input to the two versions, I get various problems with version 2.22.0
a) segmentation faults using iu/id
b) in one case, (faster but) different results when not using iu/id
c) in another case, an apparent hang when not using iu/id
I can send you the input bed files. Below are table summaries of my results and other supporting background.
/Sol Katzman
-------------------------------------------
I categorized the output files by examining the final output column (awk $NF), which should
represent whether the closest is
a) overlap ( =0)
b) downstream ( > 0)
c) upstream (< -1)
d) not found (= -1) [might include some downstream cases]
The "u-time" column is from the output of a "time bedtools closest..." command
With version 2.21.0 you can see that when there are many items to be ignored,
the processing time increases by orders of magnitude.
You can also see the different results (for one comparison) between the 2 versions.
One interesting aspect of the input bed files is that the "chrom" field is actually
(an Ensembl) gene identifier. We are trying to find the closest feature within a gene locus.
In the a1/b1 files, the "start/end" fields are actually chromosome coordinates.
In the a2/b2 files, the "start/end" fields are gene transcript coordinates.
Of course, none of that should matter, but it might be the cause of the problems with version 2.22.0
(lots of distinct "chrom" values, with long names).
Here are the results of 2 runs (a1 vs b1 and a2 vs b2) using or not using id/iu:
v2.21.0
a1.b1 -1 <-1 =0 >0 total u-time
not.id 0 585 1204 1303 3092 2.4
yes.id 402 1486 1204 0 3092 486.1
yes.iu 0 0 1204 1888 3092 2.4
a2.b2 -1 <-1 =0 >0 total u-time
not.id 23 640 1206 374 2243 1.9
yes.id 35 1002 1206 0 2243 13.4
yes.iu 402 0 1206 635 2243 364.2
v2.22.0
a1.b1 -1 <-1 =0 >0 total u-time
not.id 2 963 1203 924 3092 0.5
yes.id segFault
yes.iu segFault
a2.b2 -1 <-1 =0 >0 total u-time
not.id hang (45+ mins)
yes.id segFault
yes.iu segFault
Here are a couple of typical commands, along with some other supporting information:
$ bedtools sort -i a1.bed > a1.sorted.bed
$ bedtools sort -i a2.bed > a2.sorted.bed
$ bedtools sort -i b1.bed > b1.sorted.bed
$ bedtools sort -i b2.bed > b2.sorted.bed
$ set BEDv22ROOT = $BEDTOOLSROOT/../../bedtools-2.22.0/bin
$ set BEDv21ROOT = $BEDTOOLSROOT/../../bedtools-2.21.0/bin
$ $BEDv22ROOT/bedtools closest |& grep Version
Version: v2.22.0
$ $BEDv21ROOT/bedtools closest |& grep Version
Version: v2.21.0
$ time $BEDv21ROOT/bedtools closest -a a1.sorted.bed -b b1.sorted.bed -s -D b -t first > v21out.a1.b1.not.id.sorted.bed
2.394u 0.248s 0:02.73 96.3% 0+0k 0+0io 0pf+0w
$ time $BEDv21ROOT/bedtools closest -a a1.sorted.bed -b b1.sorted.bed -s -D b -t first -id > v21out.a1.b1.yes.id.sorted.bed
486.116u 0.633s 8:10.75 99.1% 0+0k 0+0io 0pf+0w
$ time $BEDv21ROOT/bedtools closest -a a1.sorted.bed -b b1.sorted.bed -s -D b -t first -iu > v21out.a1.b1.yes.iu.sorted.bed
2.372u 0.257s 0:02.76 94.9% 0+0k 0+0io 0pf+0w
...
$ time $BEDv22ROOT/bedtools closest -a a1.sorted.bed -b b1.sorted.bed -s -D b -t first > v22out.a1.b1.not.id.sorted.bed
0.546u 0.035s 0:00.58 98.2% 0+0k 0+0io 0pf+0w
$ time $BEDv22ROOT/bedtools closest -a a1.sorted.bed -b b1.sorted.bed -s -D b -t first -id > v22out.a1.b1.yes.id.sorted.bed
Segmentation fault (core dumped)
$ time $BEDv22ROOT/bedtools closest -a a1.sorted.bed -b b1.sorted.bed -s -D b -t first -iu > v22out.a1.b1.yes.iu.sorted.bed
Segmentation fault (core dumped)
...
$ cat v22out.a1.b1.not.id.sorted.bed | gawk '($NF == -1)' | wc -l
2
$ cat v22out.a1.b1.not.id.sorted.bed | gawk '($NF < -1)' | wc -l
963
$ cat v22out.a1.b1.not.id.sorted.bed | gawk '($NF == 0)' | wc -l
1203
$ cat v22out.a1.b1.not.id.sorted.bed | gawk '($NF > 0)' | wc -l
924
...
$ wc -l *out*.bed
3092 v21out.a1.b1.not.id.sorted.bed
3092 v21out.a1.b1.yes.id.sorted.bed
3092 v21out.a1.b1.yes.iu.sorted.bed
2243 v21out.a2.b2.not.id.sorted.bed
2243 v21out.a2.b2.yes.id.sorted.bed
2243 v21out.a2.b2.yes.iu.sorted.bed
3092 v22out.a1.b1.not.id.sorted.bed