Splitting biocaml (Was: Incubator library (Was: Memory usage with transforms and streams))

Philippe Veber

unread,

Nov 29, 2013, 11:16:53 AM11/29/13

to Biocaml

I like the idea of splitting the big library, for the very reasons you mention. However I am concerned with dependencies. We must be very careful to design sub-libraries so that they're not (or not too often) dependent in one another. Indeed, if we split biocaml, the burden of keeping dependencies up-to-date is lifted from ocamlbuild to opam, basically (is it correct?). As slow as compilation might be, omake recompiles one library much faster than opam reinstalls libraries. Also, I'm not sure how to deal with libraries from the same repo with opam. Another issue with splitting is that you have to find more names, and things can get tricky: should Biocaml_fasta go to biocaml_genomics, as fasta files are equally used and useful for proteomics? Wouldn't this be misleading?

I have a couple of other questions. You summarized the goals of splitting saying:

> But splitting allows more rapid development on sub-libraries,
> keeps compilation times smaller, and lets users install less
> when they really want to.

- Could you expand on how you see splitting allow more rapid development?
- ocamlfind sub-libraries (biocaml.base, biocaml.genomics etc ...) are enough to limit the number of dependencies a user has to bear. Do you see some downside to using them instead of full-fledged libraries?

Cheers,
Philippe.

2013/11/28 Ashish Agarwal <agarw...@gmail.com>

Sebastien and I were talking about going even further. The current library is already getting quite large, and we were thinking of splitting it into smaller ones. Here's a proposal:

biocaml_base - Very basic types and functions used throughout all other libraries, e.g. Biocaml_internal_pervasives would go here. Virtually all other libraries would depend on this one.

biocaml_genomics - Contains modules related to genomics. At this time, that would be the various file format parsers.

range (or irange) - Modules related to integer intervals. The biocaml prefix can be omitted here because the modules wouldn't have anything to do specifically with biology.

biocaml_app - The command line app could be in a separate repo.

For several of the above we need an ocamlfind sub-package biocaml_foo.lwt and biocaml_foo.async. The implementation of these should go in a single biocaml_foo repo, but they should be selectively installable.

ocaml_htslib - Given this setup, a binding to htslib should simply be a separate library. Surely we would want an asynchronous interface to this (I haven't looked but hopefully the C API allows that), and thus we would need again ocaml_htslib_lwt and ocaml_htslib_async. In this case, I think the biocaml_ prefix can be omitted. Having a top-level module called "Htslib" is intuitive and accurately represents what this library does. Although, I would hope it still follows Biocaml coding, API, and documentation standards.

The overall idea is that we treat "biocaml" as a namespace, and the overall biocaml suite contains all modules from all of the above. We could even provide a "biocaml" library that depends on all of the above. But splitting allows more rapid development on sub-libraries, keeps compilation times smaller, and lets users install less when they really want to.

Inevitably, we'll want to reorganize libraries in the future. At some point, the biocaml_genomics library will become so big that we might want to split it further. The idea is that we leave ourselves the option of doing that. We view the union of all modules in all libraries as the stable set of constructs being provided, not the specific sub-libraries.

Some details have to be considered:
* Several names are inter-related: repo name, opam package name, findlib package name, module names. Dashes are sometimes nicer than underscores, but dashes are problematic in some cases. The Core team decided to go with underscores everywhere for consistency.

* Version numbers. Given my comments above, it would make sense for all libraries to have a single version number. However, the experience of Core already proved that doesn't work. Thus, each library should be version numbered separately.

Now, for your incubator idea, there are two options:
* As you suggest, we can maintain a separate biocaml_incubator library, which is never officially released.

* If the desired modules makes sense in one of the above libraries, then you could add it there, and we mark the module as unstable. This is what we do now [1].

Any thoughts? If this sounds okay, I'll work on splitting the library early next week. If you want, I can create a biocaml_incubator repo right away and give you push access.

[1] http://biocaml.org/doc/dev/api/index.html

-Ashish

On Thu, Nov 28, 2013 at 2:17 AM, Philippe Veber <philipp...@gmail.com> wrote:

Hi Ashish,

this is an interesting suggestion, thanks. It made me wonder if we should not have a separate incubator library, for code that is still unstable in its interface but can be worth sharing among us, until it is polished enough to go into the library. It could reside in the same code base next to src/lib, and would be optionally compiled. How does it sound to you?

ph.

2013/11/27 Ashish Agarwal <agarw...@gmail.com>

We should also consider writing a Ctypes based binding to the new htslib library [1]. At the least, it would allow comparisons with our pure OCaml implementations.

[1] https://github.com/samtools/htslib

--
You received this message because you are subscribed to the Google Groups "biocaml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biocaml+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ashish Agarwal

unread,

Dec 5, 2013, 3:56:28 PM12/5/13

to Biocaml

Let's hold on splitting Biocaml just yet. Let's instead consider it on a case-by-case basis, so the conversation can be less abstract. For example, if we were going to make a binding to a C library, that's a clear candidate for being a separate library.

> Could you expand on how you see splitting allow more rapid development?

My statement was too broad. Sometimes it would help, and sometimes not.

> ocamlfind sub-libraries (biocaml.base, biocaml.genomics etc ...) are enough to limit the number of dependencies a user has to bear. Do you see some downside to using them instead of full-fledged libraries?

So, in opam the ocamlfind sub-libraries would get installed optionally depending on what other libraries the user has already installed. That could work, although I find it a bit awkward. I feel the command `opam install foo` should have a clear effect, but now `opam install foo` isn't meaningful by itself. You have to know what else was previously installed to understand what this command does.

Sebastien Mondet

unread,

Dec 5, 2013, 4:09:00 PM12/5/13

to bio...@googlegroups.com

On Thu, Dec 5, 2013 at 3:56 PM, Ashish Agarwal <agarw...@gmail.com> wrote:

Let's hold on splitting Biocaml just yet. Let's instead consider it on a case-by-case basis, so the conversation can be less abstract. For example, if we were going to make a binding to a C library, that's a clear candidate for being a separate library.

> Could you expand on how you see splitting allow more rapid development?

My statement was too broad. Sometimes it would help, and sometimes not.

> ocamlfind sub-libraries (biocaml.base, biocaml.genomics etc ...) are enough to limit the number of dependencies a user has to bear. Do you see some downside to using them instead of full-fledged libraries?

So, in opam the ocamlfind sub-libraries would get installed optionally depending on what other libraries the user has already installed. That could work, although I find it a bit awkward. I feel the command `opam install foo` should have a clear effect, but now `opam install foo` isn't meaningful by itself. You have to know what else was previously installed to understand what this command does.

moreover: when there are "more than one" optional dependencies, it can become hairy for opam,
see https://github.com/ocaml/opam-repository/issues/907 where cohttp's mirage backend would depend on both mirage-net and cstruct, but it does not work so cstruct has been put as normal dependency even if it is not always used (https://github.com/ocaml/opam-repository/blob/master/packages/cohttp/cohttp.0.9.9/opam)

Philippe Veber

unread,

Dec 8, 2013, 4:14:36 AM12/8/13

to Biocaml

2013/12/5 Ashish Agarwal <agarw...@gmail.com>

Let's hold on splitting Biocaml just yet. Let's instead consider it on a case-by-case basis, so the conversation can be less abstract. For example, if we were going to make a binding to a C library, that's a clear candidate for being a separate library.

Definitely. And just as you say, let's not hesitate to discuss a split when we think it makes sense. I think we'll settle on a case-by-case basis more easily.

> Could you expand on how you see splitting allow more rapid development?

My statement was too broad. Sometimes it would help, and sometimes not.

> ocamlfind sub-libraries (biocaml.base, biocaml.genomics etc ...) are enough to limit the number of dependencies a user has to bear. Do you see some downside to using them instead of full-fledged libraries?

So, in opam the ocamlfind sub-libraries would get installed optionally depending on what other libraries the user has already installed. That could work, although I find it a bit awkward. I feel the command `opam install foo` should have a clear effect, but now `opam install foo` isn't meaningful by itself. You have to know what else was previously installed to understand what this command does.

In my mind, all sub-libraries were installed, and were only meant to load less libraries (as you suggest with *.lwt and *.async sub-libraries). Now I totally agree that optionally compiled libraries are often difficult to deal with, and this is a strong argument in favour of split libraries, thanks!

Reply all

Reply to author

Forward