Hi all,
I've recently pushed a completed (though not enough tested) version of an alternative BAM parser. Its API uses the same types than the new Sam module and is programmed using the recently added Bgzf module. I started this work because I observed a severe performance gap between our current BAM parser and htslib, and because of memory issues [1].
I wrote a simple benchmark [2] that counts alignments in a BAM file, on which you can choose the implementation. I obtained very encouraging results in terms of speed.
Here is the output on a small BAM file:
[gesundheit:~/w/biocaml 17:24]$time _build/benchmarks/biocaml_benchmarks.opt bamcount -mode bam_alt reads18.bam
287165
real 0m3.847s
user 0m3.836s
sys 0m0.008s
[gesundheit:~/w/biocaml 17:25]$time _build/benchmarks/biocaml_benchmarks.opt bamcount -mode transform reads18.bam
287165
real 0m48.596s
user 0m42.440s
sys 0m6.152s
[gesundheit:~/w/biocaml 17:27]$time samtools flagstat reads18.bam
287165 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
287165 + 0 mapped (100.00%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
real 0m0.175s
user 0m0.172s
sys 0m0.000s
And another one on a large typical BAM file:
[gesundheit:~/w/biocaml 19:16]$time _build/benchmarks/biocaml_benchmarks.opt bamcount -mode transform reads.bam
32576694
real 89m33.159s
user 80m17.840s
sys 9m15.084s
[gesundheit:~/w/biocaml 20:47]$time _build/benchmarks/biocaml_benchmarks.opt bamcount -mode bam_alt reads.bam
32576694
real 6m7.973s
user 6m6.580s
sys 0m1.380s
[gesundheit:~/w/biocaml 21:52]$time samtools flagstat reads.bam
32576694 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
8887761 + 0 mapped (27.28%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
real 0m16.932s
user 0m16.728s
sys 0m0.200s
However, memory consumption is high on both implementations for the large file: more than 2.6GB. The reason is not clear to me in either case, not sure how to investigate that. Note that samtools is one order of magnitude faster, but it probably does much less work (our parsers decode nearly everything in each alignment).
Some remarks:
- to produce a fair comparison, I used [Bam.in_channel_to_item_stream_exn] for the transform implementation. Of course with [Bam.in_channel_to_raw_item_stream_exn] the same thing could be done much faster
- the Bam_alt module misses an unparser to be a real alternative, unfortunately I lack time to do it right now
- the module has not been tested much, so it might not be production-ready
- a further improvement would be to have a partial parsing step, like [Bam.raw_alignment], but from what I observed during the development, I'm not sure it can gain significantly more than a factor 2.