Incubator library (Was: Memory usage with transforms and streams)

5 views
Skip to first unread message

Philippe Veber

unread,
Nov 28, 2013, 2:17:57 AM11/28/13
to Biocaml
Hi Ashish,

this is an interesting suggestion, thanks. It made me wonder if we should not have a separate incubator library, for code that is still unstable in its interface but can be worth sharing among us, until it is polished enough to go into the library. It could reside in the same code base next to src/lib, and would be optionally compiled. How does it sound to you?

ph.


2013/11/27 Ashish Agarwal <agarw...@gmail.com>
We should also consider writing a Ctypes based binding to the new htslib library [1]. At the least, it would allow comparisons with our pure OCaml implementations.

[1] https://github.com/samtools/htslib



On Wed, Nov 27, 2013 at 11:10 AM, Sebastien Mondet <sebastie...@gmail.com> wrote:



On Wed, Nov 27, 2013 at 10:12 AM, Philippe Veber <philipp...@gmail.com> wrote:
Hi Sébastien!



- Does the memory increase affect speed?
I'm not sure what you mean ... are you asking if the programs makes the system swap memory?

or goes out of mem?
no, however I noticed that if I call another program via Sys.command after having taken all that memory, the other program fails
 

I was just trying to check whether it is a case where the GC just tries to do The Right Thing™ or not :)

 
- You said *two* calls to `Gc.major` ?
 Yes, I somehow remember that it is enough to be sure that all unreachable blocks are collected. Maybe it would have been better to call [Gc.full_major]?

what happens with only one?
as far as top and /usr/bin/time say, this is the same as two calls: constant memory usage.
 
or with `Gc.minor`?
memory usage grows with time, but much slower than without the call to [Gc.minor], and stays in bearable amounts.

 
- Can you try to convert the Bam to a Sam and run the same thing on the Sam file?  (Because there is some weird low-level error-prone buffering in `Biocaml_zip`)
Seems like you had the right intuition: the memory usage keeps very low without using any GC stuff!


Thanks for all the info!
 
Do you think it is a serious issue?


Yes


1. There is one issue which is that default values of the parameters may be pretty wrong:

Right now (now = head of master branch :) )
when we call  `Bam.in_channel_to_item_stream_exn ic` without the optional parameters:

- zlib_buffer_size gets Biocaml_zip.Default.zlib_buffer_size = 4096
- buffer_size gets 65_536 (which is in Biocaml_transform.in_channel_strings_to_stream)

there is also an internal Buffer.t  that should grow up to 65_562 (= a bit more than `buffer_size`) which is OK

but having zlib_buffer_size much smaller than buffer_size may be quite inefficient (It means that every time Unix.read gives something, the "unzip loop" has to run many times on small pieces of data)


A while ago, I wrote src/benchmark/benchmark_zip.ml to investigate buffer-size influences but I worked on `zip` and not `unzip`...


2. The current implementation uses Buffer.contents and String.sub → this creates a *lot* of intermediary strings

Those are maybe the ones that the GC does not free often enough (many of them will stay in the minor heap, that would be the reason why Gc.minor gets the memory usage "bearable").

A right "queue of sub-strings" data-structure would do a much better job... I think I'm going to have a lot more fun with offset/length arithmetic :D









 
Thanks a lot!
ph.





I'll also look into it.

Thanks!
Seb







On Wed, Nov 27, 2013 at 7:41 AM, Philippe Veber <philipp...@gmail.com> wrote:
Hi everyone,

I wrote a program that scans a BAM file and noticed it consumes a lot of memory, although there should be no reason it does. I simplified it to the following:

open Core.Std
open CFStream

let () =
  let open Biocaml in
  let open Sam.Flags in
  let update accu = function
    | `alignment { Sam.flags = al } ->
      if secondary_alignment al then accu
      else accu + 1
    | _ -> accu
  in
  In_channel.with_file Sys.argv.(1) ~f:(fun ic ->
    Bam.in_channel_to_item_stream_exn ic
    |> Stream.fold ~init:0 ~f:update
  )
  |> print_int

What upsets me most is that the memory footprint grows with the size of the input file. However, I'd expect this program to work in constant memory size. If I add two calls to [Gc.major] to the [update] function, then I obtain the expected memory behavior, so I'm not claiming there's a memory leak.

Have you guys met this problem before? I'm not sure this is related to transform alone or their interaction with streams or something else. I guess I could tweak the GC parameters, but maybe there is a more elegant way to fix the issue. In particular, this style should be supported out-of-the-box IMHO.

ph.

--
You received this message because you are subscribed to the Google Groups "biocaml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biocaml+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "biocaml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biocaml+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "biocaml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biocaml+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "biocaml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biocaml+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "biocaml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biocaml+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ashish Agarwal

unread,
Nov 28, 2013, 2:37:53 PM11/28/13
to Biocaml
Sebastien and I were talking about going even further. The current library is already getting quite large, and we were thinking of splitting it into smaller ones. Here's a proposal:

biocaml_base - Very basic types and functions used throughout all other libraries, e.g. Biocaml_internal_pervasives would go here. Virtually all other libraries would depend on this one.

biocaml_genomics - Contains modules related to genomics. At this time, that would be the various file format parsers.

range (or irange) - Modules related to integer intervals. The biocaml prefix can be omitted here because the modules wouldn't have anything to do specifically with biology.

biocaml_app - The command line app could be in a separate repo.

For several of the above we need an ocamlfind sub-package biocaml_foo.lwt and biocaml_foo.async. The implementation of these should go in a single biocaml_foo repo, but they should be selectively installable.

ocaml_htslib - Given this setup, a binding to htslib should simply be a separate library. Surely we would want an asynchronous interface to this (I haven't looked but hopefully the C API allows that), and thus we would need again ocaml_htslib_lwt and ocaml_htslib_async. In this case, I think the biocaml_ prefix can be omitted. Having a top-level module called "Htslib" is intuitive and accurately represents what this library does. Although, I would hope it still follows Biocaml coding, API, and documentation standards.

The overall idea is that we treat "biocaml" as a namespace, and the overall biocaml suite contains all modules from all of the above. We could even provide a "biocaml" library that depends on all of the above. But splitting allows more rapid development on sub-libraries, keeps compilation times smaller, and lets users install less when they really want to.

Inevitably, we'll want to reorganize libraries in the future. At some point, the biocaml_genomics library will become so big that we might want to split it further. The idea is that we leave ourselves the option of doing that. We view the union of all modules in all libraries as the stable set of constructs being provided, not the specific sub-libraries.

Some details have to be considered:
* Several names are inter-related: repo name, opam package name, findlib package name, module names. Dashes are sometimes nicer than underscores, but dashes are problematic in some cases. The Core team decided to go with underscores everywhere for consistency.

* Version numbers. Given my comments above, it would make sense for all libraries to have a single version number. However, the experience of Core already proved that doesn't work. Thus, each library should be version numbered separately.

Now, for your incubator idea, there are two options:
* As you suggest, we can maintain a separate biocaml_incubator library, which is never officially released.

* If the desired modules makes sense in one of the above libraries, then you could add it there, and we mark the module as unstable. This is what we do now [1].

Any thoughts? If this sounds okay, I'll work on splitting the library early next week. If you want, I can create a biocaml_incubator repo right away and give you push access.

Malcolm Matalka

unread,
Nov 28, 2013, 3:14:21 PM11/28/13
to bio...@googlegroups.com
My 2 cents:

IMO, working with incubator-like libraries is not an enjoyable
experience. The library limbo zone breeds dead code with no clear
owner. And as an app developer one cannot depend on the incubator API
at all which makes it difficult to rationalize using it for something
serious.

IMHO, all of a library should be officially supported and let semantic
versioning take care of tracking big changes.

/M

Francois Berenger

unread,
Nov 28, 2013, 7:37:12 PM11/28/13
to bio...@googlegroups.com
On 11/29/2013 05:14 AM, Malcolm Matalka wrote:
> My 2 cents:
>
> IMO, working with incubator-like libraries is not an enjoyable
> experience. The library limbo zone breeds dead code with no clear
> owner. And as an app developer one cannot depend on the incubator API
> at all which makes it difficult to rationalize using it for something
> serious.
>
> IMHO, all of a library should be officially supported and let semantic
> versioning take care of tracking big changes.

I completely agree with that.
--
Best regards,
Francois Berenger.

Siraaj Khandkar

unread,
Nov 29, 2013, 10:21:25 AM11/29/13
to bio...@googlegroups.com
I think it is a useful idea, nonetheless. Core_extended is essential
that, no? You may not want to rely on it as a dependency, but it does
contain useful ideas you may want to adopt in your project and/or
eventually promote to "core" status.
--
Siraaj Khandkar
.o.
..o
ooo

Philippe Veber

unread,
Nov 29, 2013, 10:59:05 AM11/29/13
to Biocaml
Hi Ashish, and thanks for this interesting answer! I feel that you're opening another issue so I'll answer in a separate thread (of course the two questions have some intersection, but I feel it will be clearer this way).

My initial concern is on how we should proceed to further develop biocaml, so that we can periodically release stable versions without slowing down the development of new stuff. I perfectly agree with Malcom when he says a library should expose a stable API and fully support all of its components, so that app developers can be confident in using the library. This is exactly why I'd like to draw a line between mature modules and work-in-progress modules. The former will in general only have backward-compatible changes (and if not semantic versioning would be there to remind a user of this exceptional fact), while the latter would change a lot with high probability. But still, I feel it is important that WIP modules are included to biocaml in some way, so that we could discuss them more easily, and that they could make their way progressively to the official distribution. As a result this discussion zone would not be meant to contain dead and rotten code all over the place. Half-done work certainly, but not for a long time (code that we finally do not intend to push to the official library would have to be purged eventually).

Now how could this be implemented?
1. as a sub-library (same code base), which is not compiled by default (you would need to activate a switch for that)
2. as a separate library in the same repo
3. as a separate library in a different repo

All three (maybe there are more?) options are fine to me, as long as we use the same (after all this is important, if it is meant for discussion). Well now, you could also think that such a discussion zone is not so useful, and that we'd better work on a feature "privately" and add it when it is ready. Feel free to share your opinion on this.

Philippe.

PS Related to Ashish and Sébastien's proposal for splitting biocaml, we would then simply have various incubator libraries that would eventually become officially released libraries.





2013/11/28 Ashish Agarwal <agarw...@gmail.com>

Philippe Veber

unread,
Nov 29, 2013, 11:20:48 AM11/29/13
to Biocaml
Hi Malcolm,

as mentionned in my answer to Ashish, I was referring to code that a library user is not supposed to see, that would not be officially released. What I am really after is an efficient way to discuss additions to the library collectively and efficiently. But maybe an incubator library is not an appropriate tool to achieve that?

ph.


2013/11/28 Malcolm Matalka <mmat...@gmail.com>

Philippe Veber

unread,
Nov 29, 2013, 11:21:30 AM11/29/13
to Biocaml
Yes Siraaj, that was exactly my point, thanks.
ph.


2013/11/29 Siraaj Khandkar <siraaj....@gmail.com>

Malcolm Matalka

unread,
Nov 29, 2013, 7:44:17 PM11/29/13
to bio...@googlegroups.com
IMO, The way to discuss additions to the library is pull requests.

A change is either worth incorporating as a first-class citizen or it
isn't. It's either worth the maintainers of a library taking ownership
of it or leaving it for the actual author to maintaining on their own.

I think Siraaj is right, Core_extended is an example of this. I believe
Core_extended is a wasteland of dead code. I avoid it because it's more
work for me to depend on an unsupported API than either implement myself
or decide not to use it. I think code should have a well-defined owner
that is responsible for maintaining the quality, and incubator libraries
are more-or-less antithetical to that.

BUT!

That is my development ideology, and I am not a maintainer or
contributor (yet) to Biocaml, so my opinion is just that, an opinion,
and I encourage the Biocaml team to do what they feel most comfortable
with.

/M

Philippe Veber

unread,
Nov 30, 2013, 3:15:01 AM11/30/13
to Biocaml
Well, first of all thanks for sharing your opinion with us ;)

I suggest we take an example: we discussed some time ago of how to write wrappers for calling external programs. When we started, we had kinda different views on the subject but nevertheless arrived at what seemed a reasonnable compromise. But yet, we certainly needed more time to be sure that it was the right design and in that circumstance we did not start to include the module in the library. It stayed on a branch that soon got far behind. Sometimes, choosing whether a new module should enter the library is not that easy (because it implies you will have to live with it). But this tends to make me shy, and rather that introducing something I'm uncertain about, I tend not to push it. Which is a pity, because this is certainly not the best way to have feedback and improve the code.

Cheers,
  ph.


2013/11/30 Malcolm Matalka <mmat...@gmail.com>

Francois Berenger

unread,
Dec 1, 2013, 7:07:16 PM12/1/13
to bio...@googlegroups.com
On 11/30/2013 09:44 AM, Malcolm Matalka wrote:
> IMO, The way to discuss additions to the library is pull requests.
>
> A change is either worth incorporating as a first-class citizen or it
> isn't. It's either worth the maintainers of a library taking ownership
> of it or leaving it for the actual author to maintaining on their own.

I also agree with that: either you refine a proposal (pull request)
until it is accepted, either you dump it.

Ashish Agarwal

unread,
Dec 5, 2013, 2:56:10 PM12/5/13
to Biocaml
Okay, so no consensus on this. The two proposals are:

a) Make a pull request, leave it there until it fully satisfies Biocaml's standards.
b) Push into master, but in a clearly marked incubator region.

I agree a) should be strived for. So how about we do exactly that, strive for it but not insist on it. Method b) is used even by much larger projects like Batteries and Core. As much as I'd like, it's not practical to think Biocaml can sustain even higher review standards, so I'm hesitant to forbid b) completely.




--
You received this message because you are subscribed to the Google Groups "biocaml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biocaml+unsubscribe@googlegroups.com.

Philippe Veber

unread,
Dec 8, 2013, 3:59:18 AM12/8/13
to Biocaml
Ok guys, thanks for chatting on this. I'm not sure either that was a good idea to begin with. I reckon that in practice an incubator can be an opportunity to leave a code before it is really polished. So ok to stick to a), and collectively discuss when we think b) is necessary.
Cheers,
ph.

2013/12/5 Ashish Agarwal <agarw...@gmail.com>
To unsubscribe from this group and stop receiving emails from it, send an email to biocaml+u...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages