Memory usage with transforms and streams

7 views
Skip to first unread message

Philippe Veber

unread,
Nov 27, 2013, 7:41:59 AM11/27/13
to Biocaml
Hi everyone,

I wrote a program that scans a BAM file and noticed it consumes a lot of memory, although there should be no reason it does. I simplified it to the following:

open Core.Std
open CFStream

let () =
  let open Biocaml in
  let open Sam.Flags in
  let update accu = function
    | `alignment { Sam.flags = al } ->
      if secondary_alignment al then accu
      else accu + 1
    | _ -> accu
  in
  In_channel.with_file Sys.argv.(1) ~f:(fun ic ->
    Bam.in_channel_to_item_stream_exn ic
    |> Stream.fold ~init:0 ~f:update
  )
  |> print_int

What upsets me most is that the memory footprint grows with the size of the input file. However, I'd expect this program to work in constant memory size. If I add two calls to [Gc.major] to the [update] function, then I obtain the expected memory behavior, so I'm not claiming there's a memory leak.

Have you guys met this problem before? I'm not sure this is related to transform alone or their interaction with streams or something else. I guess I could tweak the GC parameters, but maybe there is a more elegant way to fix the issue. In particular, this style should be supported out-of-the-box IMHO.

ph.

Sebastien Mondet

unread,
Nov 27, 2013, 7:58:32 AM11/27/13
to bio...@googlegroups.com

Hi

That interesting :)

- Does the memory increase affect speed? or goes out of mem?
- You said *two* calls to `Gc.major` ? what happens with only one? or with `Gc.minor`?
- Can you try to convert the Bam to a Sam and run the same thing on the Sam file?  (Because there is some weird low-level error-prone buffering in `Biocaml_zip`)

I'll also look into it.

Thanks!
Seb







--
You received this message because you are subscribed to the Google Groups "biocaml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biocaml+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Philippe Veber

unread,
Nov 27, 2013, 10:12:57 AM11/27/13
to Biocaml
Hi Sébastien!



- Does the memory increase affect speed?
I'm not sure what you mean ... are you asking if the programs makes the system swap memory?

or goes out of mem?
no, however I noticed that if I call another program via Sys.command after having taken all that memory, the other program fails
 
- You said *two* calls to `Gc.major` ?
 Yes, I somehow remember that it is enough to be sure that all unreachable blocks are collected. Maybe it would have been better to call [Gc.full_major]?

what happens with only one?
as far as top and /usr/bin/time say, this is the same as two calls: constant memory usage.
 
or with `Gc.minor`?
memory usage grows with time, but much slower than without the call to [Gc.minor], and stays in bearable amounts.

 
- Can you try to convert the Bam to a Sam and run the same thing on the Sam file?  (Because there is some weird low-level error-prone buffering in `Biocaml_zip`)
Seems like you had the right intuition: the memory usage keeps very low without using any GC stuff!

Do you think it is a serious issue?

Thanks a lot!
ph.




Sebastien Mondet

unread,
Nov 27, 2013, 11:10:08 AM11/27/13
to bio...@googlegroups.com
On Wed, Nov 27, 2013 at 10:12 AM, Philippe Veber <philipp...@gmail.com> wrote:
Hi Sébastien!



- Does the memory increase affect speed?
I'm not sure what you mean ... are you asking if the programs makes the system swap memory?

or goes out of mem?
no, however I noticed that if I call another program via Sys.command after having taken all that memory, the other program fails
 

I was just trying to check whether it is a case where the GC just tries to do The Right Thing™ or not :)

 
- You said *two* calls to `Gc.major` ?
 Yes, I somehow remember that it is enough to be sure that all unreachable blocks are collected. Maybe it would have been better to call [Gc.full_major]?

what happens with only one?
as far as top and /usr/bin/time say, this is the same as two calls: constant memory usage.
 
or with `Gc.minor`?
memory usage grows with time, but much slower than without the call to [Gc.minor], and stays in bearable amounts.

 
- Can you try to convert the Bam to a Sam and run the same thing on the Sam file?  (Because there is some weird low-level error-prone buffering in `Biocaml_zip`)
Seems like you had the right intuition: the memory usage keeps very low without using any GC stuff!


Thanks for all the info!
 
Do you think it is a serious issue?


Yes


1. There is one issue which is that default values of the parameters may be pretty wrong:

Right now (now = head of master branch :) )
when we call  `Bam.in_channel_to_item_stream_exn ic` without the optional parameters:

- zlib_buffer_size gets Biocaml_zip.Default.zlib_buffer_size = 4096
- buffer_size gets 65_536 (which is in Biocaml_transform.in_channel_strings_to_stream)

there is also an internal Buffer.t  that should grow up to 65_562 (= a bit more than `buffer_size`) which is OK

but having zlib_buffer_size much smaller than buffer_size may be quite inefficient (It means that every time Unix.read gives something, the "unzip loop" has to run many times on small pieces of data)


A while ago, I wrote src/benchmark/benchmark_zip.ml to investigate buffer-size influences but I worked on `zip` and not `unzip`...


2. The current implementation uses Buffer.contents and String.sub → this creates a *lot* of intermediary strings

Those are maybe the ones that the GC does not free often enough (many of them will stay in the minor heap, that would be the reason why Gc.minor gets the memory usage "bearable").

A right "queue of sub-strings" data-structure would do a much better job... I think I'm going to have a lot more fun with offset/length arithmetic :D

Ashish Agarwal

unread,
Nov 27, 2013, 11:28:32 AM11/27/13
to Biocaml
We should also consider writing a Ctypes based binding to the new htslib library [1]. At the least, it would allow comparisons with our pure OCaml implementations.

Philippe Veber

unread,
Nov 28, 2013, 1:33:33 AM11/28/13
to Biocaml
That sounds pretty hairy indeed :). Please let me know if there is something I can do to help (writing test/benchmark for instance) !


2013/11/27 Sebastien Mondet <sebastie...@gmail.com>
Reply all
Reply to author
Forward
0 new messages