Progress report on alternative BAM parser.

21 views
Skip to first unread message

Philippe Veber

unread,
Aug 22, 2014, 4:52:00 PM8/22/14
to Biocaml
Hi all,

I've recently pushed a completed (though not enough tested) version of an alternative BAM parser. Its API uses the same types than the new Sam module and is programmed using the recently added Bgzf module. I started this work because I observed a severe performance gap between our current BAM parser and htslib, and because of memory issues [1].

I wrote a simple benchmark [2] that counts alignments in a BAM file, on which you can choose the implementation. I obtained very encouraging results in terms of speed.

Here is the output on a small BAM file:

[gesundheit:~/w/biocaml 17:24]$time _build/benchmarks/biocaml_benchmarks.opt bamcount -mode bam_alt reads18.bam
287165
real    0m3.847s
user    0m3.836s
sys    0m0.008s
[gesundheit:~/w/biocaml 17:25]$time _build/benchmarks/biocaml_benchmarks.opt bamcount -mode transform reads18.bam
287165
real    0m48.596s
user    0m42.440s
sys    0m6.152s
[gesundheit:~/w/biocaml 17:27]$time samtools flagstat reads18.bam
287165 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
287165 + 0 mapped (100.00%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

real    0m0.175s
user    0m0.172s
sys    0m0.000s

And another one on a large typical BAM file:

[gesundheit:~/w/biocaml 19:16]$time _build/benchmarks/biocaml_benchmarks.opt bamcount -mode transform reads.bam
32576694
real    89m33.159s
user    80m17.840s
sys    9m15.084s
[gesundheit:~/w/biocaml 20:47]$time _build/benchmarks/biocaml_benchmarks.opt bamcount -mode bam_alt reads.bam
32576694
real    6m7.973s
user    6m6.580s
sys    0m1.380s
[gesundheit:~/w/biocaml 21:52]$time samtools flagstat reads.bam
32576694 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
8887761 + 0 mapped (27.28%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

real    0m16.932s
user    0m16.728s
sys    0m0.200s


However, memory consumption is high on both implementations for the large file: more than 2.6GB. The reason is not clear to me in either case, not sure how to investigate that. Note that samtools is one order of magnitude faster, but it probably does much less work (our parsers decode nearly everything in each alignment).

Some remarks:
- to produce a fair comparison, I used [Bam.in_channel_to_item_stream_exn] for the transform implementation. Of course with [Bam.in_channel_to_raw_item_stream_exn] the same thing could be done much faster

- the Bam_alt module misses an unparser to be a real alternative, unfortunately I lack time to do it right now

- the module has not been tested much, so it might not be production-ready

- a further improvement would be to have a partial parsing step, like [Bam.raw_alignment], but from what I observed during the development, I'm not sure it can gain significantly more than a factor 2.

Cheers,
  ph.

Ashish Agarwal

unread,
Aug 22, 2014, 7:12:56 PM8/22/14
to Biocaml
This is great Philippe! You've made a huge improvement over the previous implementation!

We're still far from htslib however. These are simple streaming calculations, so the large memory consumption doesn't make sense. That would be worth tracking down first, as it could also be the source of a slowdown too.

It would be good to quantify the speedup attainable by doing less parsing. For the Sam module, we can go to the extreme by using Lines directly and see how that compares. I'll try to do that.





--
You received this message because you are subscribed to the Google Groups "biocaml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biocaml+u...@googlegroups.com.
To post to this group, send email to bio...@googlegroups.com.
Visit this group at http://groups.google.com/group/biocaml.
To view this discussion on the web visit https://groups.google.com/d/msgid/biocaml/CAOOOohTZWyLUwupborRCEUofctLjTxyYsaPm1eY4ZvbjrNBKTg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Philippe Veber

unread,
Aug 25, 2014, 2:16:45 AM8/25/14
to Biocaml
Hi guys

We're still far from htslib however. These are simple streaming calculations, so the large memory consumption doesn't make sense. That would be worth tracking down first, as it could also be the source of a slowdown too.
Found it: I did not call [Zlib.inflate_end] often enough in the Bgzf module, resulting in a memory leak. With this fix, the occupied memory is stable at 15MB while parsing a regular BAM file. Time's not changed though.

 

It would be good to quantify the speedup attainable by doing less parsing. For the Sam module, we can go to the extreme by using Lines directly and see how that compares. I'll try to do that.
I tried the analogue for BAM (reading one block at a time, without looking into it), and obtained a x6 speedup. But this absolute lazy version is still 4x slower than samtools to scan a BAM file.
 

Ashish Agarwal

unread,
Aug 25, 2014, 11:02:36 AM8/25/14
to Biocaml
On Mon, Aug 25, 2014 at 2:16 AM, Philippe Veber <philipp...@gmail.com> wrote:

Found it: I did not call [Zlib.inflate_end] often enough in the Bgzf module, resulting in a memory leak. With this fix, the occupied memory is stable at 15MB while parsing a regular BAM file.

Great!


For the Sam module, we can go to the extreme by using Lines directly and see how that compares. I'll try to do that.
I tried the analogue for BAM (reading one block at a time, without looking into it), and obtained a x6 speedup.

Okay, so that argues for including less parsed raw types. This should be easy enough to do given the type definitions in Sam. They are broken down quite a bit so we can pick and choose which to include in more raw items.

The easiest way to structure this will be to provide the lower level types and converters between them. For example:

type item0 = (* something just beyond a raw line *)
type item1 = (* maybe a little more parsing *)
type header = (* current definition *)
type alignment = (* current definition *)
type item = [`Header of header | `Alignment of alignment]


Then we provide:

val item0_to_item1 : item0 -> item1 Or_error.t
val item1_to_item : item1 -> item Or_error.t

It'll take some playing around to decide on the rawer types, but they will be trivial to implement.

Note the above is similar to what was done in the Transform-based parsers, but now the converters are normal functions.


But this absolute lazy version is still 4x slower than samtools to scan a BAM file.

So we're hitting lower level issues at this point. In the Sam case, we already saw that simply counting lines is much slower than wc for example, and varied a lot between different OCaml implementations [1]. I'm quite happy with the results you have. These lower level optimizations can be done later. I'd recommend renaming the current Bam to Bam_deprecated and your new version to Bam.




Philippe Veber

unread,
Aug 26, 2014, 1:38:28 AM8/26/14
to Biocaml
2014-08-25 17:02 GMT+02:00 Ashish Agarwal <agarw...@gmail.com>:
On Mon, Aug 25, 2014 at 2:16 AM, Philippe Veber <philipp...@gmail.com> wrote:

Found it: I did not call [Zlib.inflate_end] often enough in the Bgzf module, resulting in a memory leak. With this fix, the occupied memory is stable at 15MB while parsing a regular BAM file.

Great!


For the Sam module, we can go to the extreme by using Lines directly and see how that compares. I'll try to do that.
I tried the analogue for BAM (reading one block at a time, without looking into it), and obtained a x6 speedup.

Okay, so that argues for including less parsed raw types. This should be easy enough to do given the type definitions in Sam. They are broken down quite a bit so we can pick and choose which to include in more raw items.

The easiest way to structure this will be to provide the lower level types and converters between them. For example:

type item0 = (* something just beyond a raw line *)
type item1 = (* maybe a little more parsing *)
type header = (* current definition *)
type alignment = (* current definition *)
type item = [`Header of header | `Alignment of alignment]


Then we provide:

val item0_to_item1 : item0 -> item1 Or_error.t
val item1_to_item : item1 -> item Or_error.t

It'll take some playing around to decide on the rawer types, but they will be trivial to implement.
I'm ok with this general plan, but have not enough time right now to do it. Shall we use Github issues as a TODO list?
 

Note the above is similar to what was done in the Transform-based parsers, but now the converters are normal functions.


But this absolute lazy version is still 4x slower than samtools to scan a BAM file.

So we're hitting lower level issues at this point. In the Sam case, we already saw that simply counting lines is much slower than wc for example, and varied a lot between different OCaml implementations [1]. I'm quite happy with the results you have. These lower level optimizations can be done later. I'd recommend renaming the current Bam to Bam_deprecated and your new version to Bam.
I thought it may be a bit early for this, because it did not implement the write part yet. It may not be ready before I come to NYC, but I can certainly finish this while being there.
 

Ashish Agarwal

unread,
Aug 26, 2014, 6:14:14 AM8/26/14
to Biocaml
On Tue, Aug 26, 2014 at 1:38 AM, Philippe Veber

I'm ok with this general plan, but have not enough time right now to do it. Shall we use Github issues as a TODO list?

Sure.


I thought it may be a bit early for this, because it did not implement the write part yet.

Ok, so let's wait.
 
Reply all
Reply to author
Forward
0 new messages