future: abstraction over async, lwt, and blocking calls

34 views
Skip to first unread message

Ashish Agarwal

unread,
Jan 15, 2014, 4:35:21 PM1/15/14
to Biocaml
I've pushed a new `future` branch and would appreciate any feedback. The point is to provide Lwt and Async versions of Biocaml without too much work. The approach taken is to define a signature Future.S [1] and 3 implementations of it: Future_blocking [2,3], Future_lwt [4,5], and Future_async [6,7].

The main correspondences between the 3 implementations are:

Blocking        Lwt                              Async
'a                  'a Lwt.t                         'a Deferred.t
in_channel     Lwt_io.input_channel     Reader.t
Stream.t        Lwt_stream.t                 Pipe.Reader.t

As a test case, I implemented a functor Fastq.Make [8,9], which is applied to each of the 3 implementations above to get the following 3 functions (I'm not sure `get` is the best name, but we can tweak that later):

Biocaml.Fastq.get : in_channel -> Fastq.item Stream.t
Biocaml_lwt.Fastq.get : Lwt_io.input_channel -> Fastq.item Lwt_stream.t
Biocaml_async.Fastq.get : Reader.t -> Fastq.item Pipe.Reader.t

The Future.S signature is a (very) small subset of `module type of Async.Std`. Since we've already committed to Core, I thought we should also commit to Async's style. In theory Async.Std immediately satisfies Future.S and there would be no need to define Future_async. However, as a practical matter, I had to deviate from that exact API. We almost have Future_async = Async.Std, but not quite.

I haven't done any performance tests yet. The lwt version is likely very slow due to the issue discussed here [10]. However, the solution to this should be compatible with the Future approach.


Philippe Veber

unread,
Jan 17, 2014, 7:33:22 AM1/17/14
to Biocaml
Hi Ashish, hi all,

Glad you started working on this front! On the signature side, everything looks just fine to me, only have two minor remarks:
- in biocaml_future.ml why not move the definition of Read_result outside of S? This would avoid defining the same type alias in all implementations (ok, there are only three of them, but still)
- in Biocaml_fastq, I suggest read_item instead of get

Now the bigger questions:
- do you plan to add signature items for writing too?
- do you foresee any trouble to use transforms in this setting?

Thanks for sharing this!
ph.


2014/1/15 Ashish Agarwal <agarw...@gmail.com>

--
You received this message because you are subscribed to the Google Groups "biocaml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biocaml+u...@googlegroups.com.
To post to this group, send email to bio...@googlegroups.com.
Visit this group at http://groups.google.com/group/biocaml.
To view this discussion on the web visit https://groups.google.com/d/msgid/biocaml/CAMu2m2K6TNXHqeDqNN-dOs%2BfOpXh5tABYHYNsAkMusX5TfAoTQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ashish Agarwal

unread,
Jan 17, 2014, 10:48:53 AM1/17/14
to Biocaml
On Fri, Jan 17, 2014 at 7:33 AM, Philippe Veber <philipp...@gmail.com> wrote:

- in biocaml_future.ml why not move the definition of Read_result outside of S?

I'd like to keep the signature as equivalent to Async.Std as possible. I don't want to introduce yet another API that people have to learn. My hope is to say "Future.S is Async. Go read Async's documentation".


- in Biocaml_fastq, I suggest read_item instead of get

read_item sounds like you get just one item, but actually this function returns a stream from which you can read all items.

Also keep in mind we need the inverse name to make sense. We'll need the following function:

put : 'a Stream.t -> out_channel -> unit

which will write the stream of values to the given out_channel. Possible choices I've been thinking of are:

* get/put
* read/write
* of_channel/to_channel
* get_stream/put_stream ( but I don't like this because in Future_async, a "stream" is a Pipe.Reader.t, not an Async.Stream.t which also exists but is deprecated )


- do you plan to add signature items for writing too?

Yes. This should be trivial.

 
- do you foresee any trouble to use transforms in this setting?

One reason I like this approach is that we don't need transforms anymore. My feeling is the Transform module is difficult to understand and discourages contributions. The motivation for it was to enable Lwt and Async support, but now we're getting that without going through Transform.

Also note that Pipe is a more sophisticated implementation of Transform. Admittedly it is also difficult to understand (if you use its more complex features), but at least it is part of a more widely used library. That goes a long way to making it more usable. For example, Pipe is covered in Real World OCaml.

Similarly, we have Future.S.Deferred.Result.t, which can replace Flow.t, but better IMO. For one thing, Flow is defined only for Lwt. The functorization now gives us the Flow monad for Lwt, Async, and blocking calls. And again, we have the benefit of uniformity with an existing library, rather than introducing yet another API.

Overall, we can remove Transform and Flow, substantially simplifying Biocaml's implementation, while at the same time increasing its functionality.

Ashish Agarwal

unread,
Jan 17, 2014, 11:09:29 AM1/17/14
to Biocaml
One more thing. Maybe I should make Future a separate library. I can already see myself wanting it for other projects.

Sebastien Mondet

unread,
Jan 17, 2014, 11:26:26 AM1/17/14
to bio...@googlegroups.com
On Fri, Jan 17, 2014 at 11:09 AM, Ashish Agarwal <agarw...@gmail.com> wrote:
One more thing. Maybe I should make Future a separate library. I can already see myself wanting it for other projects.



Before committing to it, I would try to use the approach further, i.e. functorize more than one level and actually use it in an application. Because functorizing everything over a deferred-like signature, we have already tried, and it does not scale very far (preserving type-equalities, especially with Result.t's, slowly becomes a nightmare).




 
--
You received this message because you are subscribed to the Google Groups "biocaml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biocaml+u...@googlegroups.com.
To post to this group, send email to bio...@googlegroups.com.
Visit this group at http://groups.google.com/group/biocaml.

Ashish Agarwal

unread,
Jan 17, 2014, 12:16:20 PM1/17/14
to Biocaml
On Fri, Jan 17, 2014 at 11:26 AM, Sebastien Mondet <sebastie...@gmail.com> wrote:

functorizing everything over a deferred-like signature, we have already tried, and it does not scale very far (preserving type-equalities, especially with Result.t's, slowly becomes a nightmare).

Okay, I'll keep working in the `future` branch and get to the point of a command line app. Can you expand on why Result.t's are especially difficult?

Sebastien Mondet

unread,
Jan 17, 2014, 12:25:38 PM1/17/14
to bio...@googlegroups.com
Actually it's not really because of Result.t itself, some of the problems I remember were coming from *open* polymorphic variants (that we want on the `Error` side of a Result.t).


 

--
You received this message because you are subscribed to the Google Groups "biocaml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biocaml+u...@googlegroups.com.
To post to this group, send email to bio...@googlegroups.com.
Visit this group at http://groups.google.com/group/biocaml.

Philippe Veber

unread,
Jan 18, 2014, 4:26:34 AM1/18/14
to Biocaml



2014/1/17 Ashish Agarwal <agarw...@gmail.com>

On Fri, Jan 17, 2014 at 7:33 AM, Philippe Veber <philipp...@gmail.com> wrote:

- in biocaml_future.ml why not move the definition of Read_result outside of S?

I'd like to keep the signature as equivalent to Async.Std as possible. I don't want to introduce yet another API that people have to learn. My hope is to say "Future.S is Async. Go read Async's documentation".

My point was just that as long as Read_result only contains a polymorphic variant type, it could be removed from S without altering compatibility with Async. But ok, at some point you'll want to equip Read_result with operations (monad operations, to begin with).



- in Biocaml_fastq, I suggest read_item instead of get

read_item sounds like you get just one item, but actually this function returns a stream from which you can read all items.
my mistake, I did not pay enough attention. In that case, read/write seems a better alternative to me (Fastq.read ic).
 
- do you foresee any trouble to use transforms in this setting?

One reason I like this approach is that we don't need transforms anymore. My feeling is the Transform module is difficult to understand and discourages contributions. The motivation for it was to enable Lwt and Async support, but now we're getting that without going through Transform.
Well, you know I had my share of troubles with transforms ;o), but I'd say that it was probably more an issue of being confronted to a new style than intrinsic difficulty. Transform is somehow rather low-level as you are in contact with the buffering mechanism (cf the `not_ready answer). But maybe this opens up to better performances?

All in all, I'd personnally be more confortable with your proposal, as it is a bit higher level and does not introduce a new type.
 

Also note that Pipe is a more sophisticated implementation of Transform. Admittedly it is also difficult to understand (if you use its more complex features), but at least it is part of a more widely used library. That goes a long way to making it more usable. For example, Pipe is covered in Real World OCaml.

Similarly, we have Future.S.Deferred.Result.t, which can replace Flow.t, but better IMO. For one thing, Flow is defined only for Lwt. The functorization now gives us the Flow monad for Lwt, Async, and blocking calls. And again, we have the benefit of uniformity with an existing library, rather than introducing yet another API.
Certainly.
 

Overall, we can remove Transform and Flow, substantially simplifying Biocaml's implementation, while at the same time increasing its functionality.
I think we have to be careful about performance too before deciding to switch. Have you got some insights on this yet?

 

--
You received this message because you are subscribed to the Google Groups "biocaml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biocaml+u...@googlegroups.com.
To post to this group, send email to bio...@googlegroups.com.
Visit this group at http://groups.google.com/group/biocaml.

Philippe Veber

unread,
Jan 18, 2014, 4:52:46 AM1/18/14
to Biocaml



2014/1/17 Sebastien Mondet <sebastie...@gmail.com>




On Fri, Jan 17, 2014 at 12:16 PM, Ashish Agarwal <agarw...@gmail.com> wrote:
On Fri, Jan 17, 2014 at 11:26 AM, Sebastien Mondet <sebastie...@gmail.com> wrote:

functorizing everything over a deferred-like signature, we have already tried, and it does not scale very far (preserving type-equalities, especially with Result.t's, slowly becomes a nightmare).

Okay, I'll keep working in the `future` branch and get to the point of a command line app. Can you expand on why Result.t's are especially difficult?


 
Actually it's not really because of Result.t itself, some of the problems I remember were coming from *open* polymorphic variants (that we want on the `Error` side of a Result.t).

Yes, the "with type" annotations do not play well with open polymorphic variant types, you're right. I was wondering if first-class modules couldn't be helpful here, to avoid functors completely in client code. Given that Future.S.t has a parameter, this is not trivial, but should be doable user the higher library [1]. I'll give it a thought, just for fun :o)

[1] https://github.com/ocamllabs/higher/raw/paper/higher.pdf

Philippe Veber

unread,
Jan 18, 2014, 2:35:30 PM1/18/14
to Biocaml
Hi again,

using higher kinded polymorphism was indeed fun, but not that useful :o) Here's what it looks like:

open Higher

(* This what Future.S looks like after defunctionalization, which goal is to replace 'a Deferred.t by ('a, Deferred.t) app *)
module type T =
sig
  module Deferred : sig
    type t
    val return : 'a -> ('a, t) app
    val bind : ('a, t) app -> ('a -> ('b, t) app) -> ('b, t) app
  end

  module Reader : sig
    module Read_result : sig
      type 'a t = [ `Eof | `Ok of 'a ]
    end

    type t

    val read_line : t -> (string Read_result.t, Deferred.t) app
  end

end

(* This is how a Future.S is defunctionalized, the main job is to call injections and projections appropriately *)
 
module Higher_future(F : S) : T = struct
  module Deferred = struct
    include Higher.Newtype1(F.Deferred)
    let return x = inj (F.Deferred.return x)
    let bind m f = inj (F.Deferred.bind (prj m) (fun x -> prj (f x)))
  end

  module Reader = struct
    module Read_result = struct
      type 'a t = [ `Eof | `Ok of 'a ]
    end

    type t = F.Reader.t

    let read_line reader = Deferred.inj (F.Reader.read_line reader)
  end
end

(* first parameter is for the deferred monad type, second is for the reader type)
type ('a, 'b) future = (module T with type Deferred.t = 'a and type Reader.t = 'b)

(* here's the meat: define a function that will take a future (as a first-class module) and use its functions to get the job done *)
let get (type m) (type reader) (f : (m, reader) future) reader =
  let module F = (val f) in
  F.Deferred.bind
    (F.Reader.read_line reader)
    (function
    | `Ok x -> F.Deferred.return (Result.Ok x)
    | `Eof -> F.Deferred.return (Result.Error `empty_file))


In the end, the type we obtain for get is:

('a, 'b) future -> 'b -> ((string, [> `empty_file ]) Result.t, 'a) app

To use this get function in practice, you would have to write something like:

get (module Async) reader

However, you'd obtain an app type, so you would have to call a projection function after that.

Conclusion: fun but certainly cumbersome in practice. Having to convert app values back to their monad is doable but heavy. I'm not really sure in what situation using this technique would be a clear win. My initial motivation was to see if that'd help to use open polymorphic variant types for errors. I don't know if it'd scale, but it should work in principle:

 let may (type m) (type reader) (m : (m, reader) future) x f =
   let module F = (val m) in
   F.Deferred.bind x (function
   | Result.Ok x ->
     F.Deferred.bind (f x) (function
     | Result.Ok x -> F.Deferred.return (Result.Ok x)
     | Result.Error y -> F.Deferred.return (Result.Error y)
     )
   | Result.Error y -> F.Deferred.return (Result.Error y)
   )

 let dummy
     (type m) (type reader)
     (m : (m, reader) future)
     (x : ((string, [> `A]) Result.t, m) app)
     (f : string -> ((int, [> `B]) Result.t, m) app) =
   may m x f

The infered types respectively are:

val may :
  ('a, 'b) future ->
  (('c, 'd) Result.t, 'a) app ->
  ('c -> (('e, 'd) Result.t, 'a) app) -> (('e, 'd) Result.t, 'a) app

val dummy :
  ('a, 'b) future ->
  ((string, [> `A | `B ] as 'c) Result.t, 'a) app ->
  (string -> ((int, 'c) Result.t, 'a) app) -> ((int, 'c) Result.t, 'a) app

So here we are: the parser is parameterized by the thread monad, and produces types containing an error type as open polymorphic variant, which can be composed when composing functions. Call me stubborn :o), but I keep thinking we are better off with a nice functor (like Ashish did) and errors as exceptions (and hence, not annotated in the parser type).

Cheers,
ph.







2014/1/18 Philippe Veber <philipp...@gmail.com>

Ashish Agarwal

unread,
Jan 19, 2014, 10:49:47 AM1/19/14
to Biocaml
On Sat, Jan 18, 2014 at 4:26 AM, Philippe Veber <philipp...@gmail.com> wrote:

My point was just that as long as Read_result only contains a polymorphic variant type, it could be removed from S without altering compatibility with Async. But ok, at some point you'll want to equip Read_result with operations (monad operations, to begin with).

Exactly, so I think we're better off blindly mimicking Async's API.


read/write seems a better alternative to me (Fastq.read ic).

Yeah, read/write is sounding good to me. Also it is short unlike our current unwieldy in_channel_to_item_stream.

I think we have to be careful about performance too before deciding to switch. Have you got some insights on this yet?

Not yet. Performance testing is on my todo list, and I agree it it necessary before going too far with any one choice.

Ashish Agarwal

unread,
Jan 19, 2014, 10:55:41 AM1/19/14
to Biocaml
On Sat, Jan 18, 2014 at 4:52 AM, Philippe Veber <philipp...@gmail.com> wrote:

Yes, the "with type" annotations do not play well with open polymorphic variant types, you're right.

Is there a simple example demonstrating the issue? I don't see what it could be given the functorization doesn't affect at all the type variables that the polymorphic variants would instantiate.

Philippe Veber

unread,
Jan 20, 2014, 9:48:11 AM1/20/14
to Biocaml



2014/1/19 Ashish Agarwal <agarw...@gmail.com>

On Sat, Jan 18, 2014 at 4:52 AM, Philippe Veber <philipp...@gmail.com> wrote:

Yes, the "with type" annotations do not play well with open polymorphic variant types, you're right.

Is there a simple example demonstrating the issue? I don't see what it could be given the functorization doesn't affect at all the type variables that the polymorphic variants would instantiate.

Err, now I'm not sure what I meant either :o). I remember having troubles with open polymorphic variants types and functors, but I have nothing specific in mind right now.

In order to be a little more constructive I started a benchmark on the future branch [1]. The bench is on counting the lines of a Fastq file, I do it with Transform and Future, using blocking or Lwt threads. The raw results:

  Thread      Transform      Future  
----------  -------------  ----------
  Blocking        0.63        0.51
       Lwt        0.99        3.53


A couple of remarks:
- I did not pay attention to size of read buffers
- the lwt/transform version furiously leaks memory so don't try the bench on a big file (up to a couple millions of lines is ok)
- more generally the lwt/transform version certainly is poorly written (I tried to get inspiration on some code of biocaml_app_common)
- it lacks an Async line, but I'd like to have a better lwt/transform version before starting it

Ok, so let's say it's a start, please feel free to improve it directly or to give me some indications on how to improve the lwt/transform version.


Cheers,
  Philippe.


[1] https://github.com/pveber/biocaml/tree/future

Sebastien Mondet

unread,
Jan 20, 2014, 11:30:41 PM1/20/14
to bio...@googlegroups.com

In the code base we had at NYU, the main library was abstracted over an IO monad (up to May 2012), the problem was that the whole module dependency tree had to be "followed" by the functorization

Here a simple example would be:

If we have:
Line: IO -> LINE
Fastq: IO -> FASTQ
Fasta: IO -> FASTA

Then the implementation of Fastq.Make must instanciate Line(IO) to use it
same for Fasta.Make

Then a program using both Fastq and Fasta will end up with incompatible Lines.t unless all the "with type" and/or "with module" have been passed around from bottom to top correctly (and that means *every* possible type or submodule defined anywhere)

The other solution is to implement the dependency tree with the functors:

Line: IO -> LINE
Fastq: LINE -> FASTQ
Fasta: LINE -> FASTQ
Biocaml: FASTA -> FASTQ -> ... -> BIOCAML

and that is also a big pain, if suddenly there is something like a Log module that is between IO and LINE we have to redefine everything (we can create mid-level "comon stuff" modules but it is still very painful),
or just imagine the 20 MB error message we'd get when we change an error type that is exposed in BIOCAML.
(yes, before the defunktorization, I was often crashing/freezing emacs with a simple `M-x compile` and one of those signature mismatch messages)


Anil just tweeted about the topic: https://twitter.com/avsm/status/425322323122462720 :D


Yes, the "with type" annotations do not play well with open polymorphic variant types, you're right.

Is there a simple example demonstrating the issue? I don't see what it could be given the functorization doesn't affect at all the type variables that the polymorphic variants would instantiate.

Err, now I'm not sure what I meant either :o). I remember having troubles with open polymorphic variants types and functors, but I have nothing specific in mind right now.

In order to be a little more constructive I started a benchmark on the future branch [1]. The bench is on counting the lines of a Fastq file, I do it with Transform and Future, using blocking or Lwt threads. The raw results:

  Thread      Transform      Future  
----------  -------------  ----------
  Blocking        0.63        0.51
       Lwt        0.99        3.53

 

A couple of remarks:
- I did not pay attention to size of read buffers
- the lwt/transform version furiously leaks memory so don't try the bench on a big file (up to a couple millions of lines is ok)

It's strange to leak memory there,
and it still performs pretty well compared to the functor (0.99 Vs 3.53) ?
 

- more generally the lwt/transform version certainly is poorly written (I tried to get inspiration on some code of biocaml_app_common)
- it lacks an Async line, but I'd like to have a better lwt/transform version before starting it

Ok, so let's say it's a start, please feel free to improve it directly or to give me some indications on how to improve the lwt/transform version.


Cheers,
  Philippe.


[1] https://github.com/pveber/biocaml/tree/future

--
You received this message because you are subscribed to the Google Groups "biocaml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biocaml+u...@googlegroups.com.
To post to this group, send email to bio...@googlegroups.com.
Visit this group at http://groups.google.com/group/biocaml.

Philippe Veber

unread,
Jan 21, 2014, 2:44:11 AM1/21/14
to Biocaml



2014/1/21 Sebastien Mondet <sebastie...@gmail.com>


In the code base we had at NYU, the main library was abstracted over an IO monad (up to May 2012), the problem was that the whole module dependency tree had to be "followed" by the functorization

Here a simple example would be:

If we have:
Line: IO -> LINE
Fastq: IO -> FASTQ
Fasta: IO -> FASTA

Then the implementation of Fastq.Make must instanciate Line(IO) to use it
same for Fasta.Make

Then a program using both Fastq and Fasta will end up with incompatible Lines.t unless all the "with type" and/or "with module" have been passed around from bottom to top correctly (and that means *every* possible type or submodule defined anywhere)
Sorry, I must miss the obvious but although I understand (and agree with) your statement in general, I fail to see where the problem is in our case:

module type IO = sig
  type reader
  val read : reader -> string
end

module Line(IO : IO) : sig
  type t
  val read : IO.reader -> t list
end
=
struct
  type t
  let read _ = assert false
end

module Fastq(IO : IO) : sig
  type item
  val of_lines : Line(IO).t list -> item list
end
=
struct
  type item
  let of_lines _ = assert false
end

module Fasta(IO : IO) : sig
  type item
  val of_lines : Line(IO).t list -> item list
end
=
struct
  type item
  let of_lines _ = assert false
end

module App(IO : IO) = struct
  module Line = Line(IO)
  module Fastq = Fastq(IO)
  module Fasta = Fasta(IO)

  let f reader =
    let lines = Line.read reader in
    Fastq.of_lines lines, Fasta.of_lines lines
end

To be more precise, isn't it enough to write types of the form F(M).t in signatures to avoid type equations most of the time, if not always (in our particular case of course)?
 

The other solution is to implement the dependency tree with the functors:

Line: IO -> LINE
Fastq: LINE -> FASTQ
Fasta: LINE -> FASTQ
Biocaml: FASTA -> FASTQ -> ... -> BIOCAML

and that is also a big pain, if suddenly there is something like a Log module that is between IO and LINE we have to redefine everything (we can create mid-level "comon stuff" modules but it is still very painful),
or just imagine the 20 MB error message we'd get when we change an error type that is exposed in BIOCAML.
(yes, before the defunktorization, I was often crashing/freezing emacs with a simple `M-x compile` and one of those signature mismatch messages)
yes, this solution is terrible.


 


Anil just tweeted about the topic: https://twitter.com/avsm/status/425322323122462720 :D

Hey, you just made three people laugh this morning, thanks!
 


Yes, the "with type" annotations do not play well with open polymorphic variant types, you're right.

Is there a simple example demonstrating the issue? I don't see what it could be given the functorization doesn't affect at all the type variables that the polymorphic variants would instantiate.

Err, now I'm not sure what I meant either :o). I remember having troubles with open polymorphic variants types and functors, but I have nothing specific in mind right now.

In order to be a little more constructive I started a benchmark on the future branch [1]. The bench is on counting the lines of a Fastq file, I do it with Transform and Future, using blocking or Lwt threads. The raw results:

  Thread      Transform      Future  
----------  -------------  ----------
  Blocking        0.63        0.51
       Lwt        0.99        3.53

 

A couple of remarks:
- I did not pay attention to size of read buffers
- the lwt/transform version furiously leaks memory so don't try the bench on a big file (up to a couple millions of lines is ok)

It's strange to leak memory there,
Maybe leak is not the appropriate term, since a call to Gc.full_major could get the memory back. I mean memory consumption grows a lot if nothing's done. Actually the symptoms are pretty similar to what we discussed on BAM files. But I'm not so smart I couldn't make an obvious blunder when writing the lwt/transform code. I'd even say there's a high probability I did.

 
and it still performs pretty well compared to the functor (0.99 Vs 3.53) ?
Yes, at least until it starts to make the computer swap, of course. The bad perf of lwt/future may be related to what Ashish mentions at the end of the first mail of this thread, namely that Lwt's read_line implementation is much less efficient than Pervasives's input_line. @Ashish: did you try to come up with a better implementation of Lwt_io.read_line?



 

- more generally the lwt/transform version certainly is poorly written (I tried to get inspiration on some code of biocaml_app_common)
- it lacks an Async line, but I'd like to have a better lwt/transform version before starting it

Ok, so let's say it's a start, please feel free to improve it directly or to give me some indications on how to improve the lwt/transform version.


Cheers,
  Philippe.


[1] https://github.com/pveber/biocaml/tree/future

--
You received this message because you are subscribed to the Google Groups "biocaml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biocaml+u...@googlegroups.com.
To post to this group, send email to bio...@googlegroups.com.
Visit this group at http://groups.google.com/group/biocaml.
To view this discussion on the web visit https://groups.google.com/d/msgid/biocaml/CAOOOohRX6Mn9Mv_h%2Bm00nAFrf2CoSwjAQzYD9_eJcxs1EJ0HWA%40mail.gmail.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "biocaml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biocaml+u...@googlegroups.com.
To post to this group, send email to bio...@googlegroups.com.
Visit this group at http://groups.google.com/group/biocaml.

Ashish Agarwal

unread,
Jan 21, 2014, 8:39:11 AM1/21/14
to Biocaml
Regarding the dependency chain that functorization causes, note that I'm proposing to do this over a small set of Biocaml's implementation. It is only for the IO functions, which is about 2 functions per module (and only modules regarding file parsing/printing).

Regarding the type equalities problem, I don't think of it as too big of a problem. With the right type constraints, you should get what you need. It seems Philippe's suggestion of writing out full types like Line(IO).t would help, but actually I think even this won't be needed. The proposed functor doesn't generate modules with abstract types. Thus, you can just write Lines.t for example, which is outside all functors. Multiple generated types equal to Lines.t will all be compatible.

Finally, I don't see any better solution. We can go through Transform, but I'm quite uncomfortable having a whole new data structure that is: non-trivial to implement as we can see from the Pipe implementation and has been difficult for many people to understand. This is replacing the complexity of functors with another complexity.

We can also maybe do something with first class modules, and I'm happy to try that out. I don't think that will be fundamentally different.


Anil just tweeted about the topic: https://twitter.com/avsm/status/425322323122462720 :D

:)  Okay, so its a pain, but they're still doing it. It's the best solution they can think of too.



Ashish Agarwal

unread,
Jan 21, 2014, 8:44:21 AM1/21/14
to Biocaml
@Ashish: did you try to come up with a better implementation of Lwt_io.read_line?

I don't know if a better read_line will work or if a slightly different function will be needed like:

read_map_lines: input -> f:(string -> 'a) -> 'a Stream.t

This function would read in chunks of data from `input`, split those chunks into lines, apply f to each line, returning the result in a stream. This is essentially the solution that gave better performance in the thread on the Ocsigen list. I just need to polish up the code written there and make it more general. That should make a big change to the performance numbers you computed.




Malcolm Matalka

unread,
Jan 21, 2014, 9:04:58 AM1/21/14
to bio...@googlegroups.com
I might have missed this, but what is the reasoning for providing an
async interface for reading these file formats?

In my, limited, experience, I generally have an async program
orchestrating other blocking programs that actually perform the
operations on the files.

And if an async program needs to read files they can use the blocking
API via an OS thread, which isn't super expensive because usually you're
only reading/writing a few files at a time.

/M

Ashish Agarwal

unread,
Jan 21, 2014, 9:24:29 AM1/21/14
to Biocaml
On Tue, Jan 21, 2014 at 9:04 AM, Malcolm Matalka <mmat...@gmail.com> wrote:

I might have missed this, but what is the reasoning for providing an
async interface for reading these file formats?

Good question! Well, we want to test the performance differences between the various approaches (using an OS thread or not), but we can't unless we first have the options available. Also, it would be nice to have the API for sake of uniformity since it's not hard to provide (despite our lengthy discussion of how to provide it, there's no fundamental challenge. The code is trivial). Finally, having the Lwt version would let us play with js_of_ocaml.

Sebastien Mondet

unread,
Jan 21, 2014, 10:22:08 AM1/21/14
to bio...@googlegroups.com

Philippe:
To be more precise, isn't it enough to write types of the form F(M).t in signatures to avoid type equations most of the time, if not always (in our particular case of course)?

It seems fine indeed, two years ago it was like that (but with much more types) and I ended up switching to the other solution, and then I defunktorized :)


On Tue, Jan 21, 2014 at 9:24 AM, Ashish Agarwal <agarw...@gmail.com> wrote:
On Tue, Jan 21, 2014 at 9:04 AM, Malcolm Matalka <mmat...@gmail.com> wrote:
I might have missed this, but what is the reasoning for providing an
async interface for reading thesee
 
file formats?


But have you done that for big files?

Here what we want is to go through very large files in a streaming fashion, so if the parsing is in a posix-thread there will be a lot of inter-thread communication (very time a chunk is read it has to pass the hand to the main async "thread", it's easy to do with Lwt_condition but I don't know if the performance is good enough:  there is a lot of "OS" context switching).

 
Good question! Well, we want to test the performance differences between the various approaches (using an OS thread or not), but we can't unless we first have the options available. Also, it would be nice to have the API for sake of uniformity since it's not hard to provide (despite our lengthy discussion of how to provide it, there's no fundamental challenge. The code is trivial). Finally, having the Lwt version would let us play with js_of_ocaml.


we still depend on core, with js_of_ocaml it is still not really viable


 

--
You received this message because you are subscribed to the Google Groups "biocaml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biocaml+u...@googlegroups.com.
To post to this group, send email to bio...@googlegroups.com.
Visit this group at http://groups.google.com/group/biocaml.

Malcolm Matalka

unread,
Jan 21, 2014, 10:48:53 AM1/21/14
to bio...@googlegroups.com
> But have you done that for big files?
>
> Here what we want is to go through very large files in a streaming fashion,
> so if the parsing is in a posix-thread there will be a lot of inter-thread
> communication (very time a chunk is read it has to pass the hand to the
> main async "thread", it's easy to do with Lwt_condition but I don't know if
> the performance is good enough: there is a lot of "OS" context switching).

IME, if you are processing a large file, you are better off tossing the
async loop out as is, because it's doing a lot of work you don't want
and it will cost you. I think it's better to toss those operations in
another process and let the OS handle it. Depending on the problem
you're solving, you'll want multi-core support anyways. And you'll
probably want to be able to run across a cluster, so processes are the
optimal (IMO) unit for such things.

I think, for big files, you're better off specializing the processing
program and using an async program to orchestrate things, like a
workflow engine.

Sebastien Mondet

unread,
Jan 21, 2014, 11:05:38 AM1/21/14
to bio...@googlegroups.com
On Tue, Jan 21, 2014 at 10:48 AM, Malcolm Matalka <mmat...@gmail.com> wrote:
> But have you done that for big files?
>
> Here what we want is to go through very large files in a streaming fashion,
> so if the parsing is in a posix-thread there will be a lot of inter-thread
> communication (very time a chunk is read it has to pass the hand to the
> main async "thread", it's easy to do with Lwt_condition but I don't know if
> the performance is good enough:  there is a lot of "OS" context switching).

IME, if you are processing a large file, you are better off tossing the
async loop out as is, because it's doing a lot of work you don't want
and it will cost you.  I think it's better to toss those operations in
another process and let the OS handle it.  Depending on the problem
you're solving, you'll want multi-core support anyways.  And you'll
probably want to be able to run across a cluster, so processes are the
optimal (IMO) unit for such things.


but then you loose the power of Lwt/Async for some processings, e.g. you have sequences and quality scores in two separate FASTA files, it's nice to orchestrate the parsing of two files at once with an async library
in a separate non-async process you'll end up using posix-threads + shared mem and all the horrible things that happen with them :)

 
I think, for big files, you're better off specializing the processing
program and using an async program to orchestrate things, like a
workflow engine.


but even in the separate processes you can use the async library (and in the distributed case, the async library will help a lot for the communication).


also, in the case of Lwt, unlike the Unix/Pervasives modules Lwt_unix/Lwt_io gives fine-grained access to the buffer size used for the unix calls, this can give a nice performance boost (the optimal buffer-size depends on the filesystem... among other things).



 

Ashish Agarwal

unread,
Jan 21, 2014, 11:10:15 AM1/21/14
to Biocaml
> we still depend on core, with js_of_ocaml it is still not really viable

I'm hopeful Core will become js_of_ocaml compatible over time. They continue to make improvements in this direction.

Sebastien Mondet

unread,
Jan 21, 2014, 11:17:58 AM1/21/14
to bio...@googlegroups.com
On Tue, Jan 21, 2014 at 11:10 AM, Ashish Agarwal <agarw...@gmail.com> wrote:
> we still depend on core, with js_of_ocaml it is still not really viable

I'm hopeful Core will become js_of_ocaml compatible over time. They continue to make improvements in this direction.

If you relax the definition of "compatible" to "it compiles and doesn't crash at start-up" then I think Core_kernel is already there (they removed "Num")
but a web-application that loads mega-bytes of unused code, and runs tons of unused module initializations will remain a joke by any standard
(and those initializations leave the library in a very risky inconsistent state, it's difficult to know which functions can be called or not).


 

--
You received this message because you are subscribed to the Google Groups "biocaml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biocaml+u...@googlegroups.com.
To post to this group, send email to bio...@googlegroups.com.
Visit this group at http://groups.google.com/group/biocaml.

Ashish Agarwal

unread,
Jan 21, 2014, 11:21:17 AM1/21/14
to Biocaml
On Tue, Jan 21, 2014 at 11:17 AM, Sebastien Mondet <sebastie...@gmail.com> wrote:
 
If you relax the definition of "compatible" to "it compiles and doesn't crash at start-up" then I think Core_kernel is already there (they removed "Num")
but a web-application that loads mega-bytes of unused code, and runs tons of unused module initializations will remain a joke by any standard
(and those initializations leave the library in a very risky inconsistent state, it's difficult to know which functions can be called or not).

I see. Okay, well js_of_ocaml was only a minor consideration. It wouldn't be useful for real work anyway since most of Biocaml's functionality is on large files. Also, it might be possible for us to remove the dependency on Core for select modules, if we ever cared to.

Malcolm Matalka

unread,
Jan 21, 2014, 11:50:17 AM1/21/14
to bio...@googlegroups.com
Sebastien Mondet <sebastie...@gmail.com> writes:

> On Tue, Jan 21, 2014 at 10:48 AM, Malcolm Matalka <mmat...@gmail.com>wrote:
>
>> > But have you done that for big files?
>> >
>> > Here what we want is to go through very large files in a streaming
>> fashion,
>> > so if the parsing is in a posix-thread there will be a lot of
>> inter-thread
>> > communication (very time a chunk is read it has to pass the hand to the
>> > main async "thread", it's easy to do with Lwt_condition but I don't know
>> if
>> > the performance is good enough: there is a lot of "OS" context
>> switching).
>>
>> IME, if you are processing a large file, you are better off tossing the
>> async loop out as is, because it's doing a lot of work you don't want
>> and it will cost you. I think it's better to toss those operations in
>> another process and let the OS handle it. Depending on the problem
>> you're solving, you'll want multi-core support anyways. And you'll
>> probably want to be able to run across a cluster, so processes are the
>> optimal (IMO) unit for such things.
>>
>>
> but then you loose the power of Lwt/Async for some processings, e.g. you
> have sequences and quality scores in two separate FASTA files, it's nice to
> orchestrate the parsing of two files at once with an async library
> in a separate non-async process you'll end up using posix-threads + shared
> mem and all the horrible things that happen with them :)
>

I don't have any performance numbers, but assuming spinny disks, reading
two files at the same time probably isn't going to benefit that much
from an async loop unless you tailor your hardware setup very
specifically. And even then, how much time are you going to spend in
the event loop that you'd actually benefit more from being spent in your
code?

>
>
>> I think, for big files, you're better off specializing the processing
>> program and using an async program to orchestrate things, like a
>> workflow engine.
>>
>>
> but even in the separate processes you can use the async library (and in
> the distributed case, the async library will help a lot for the
> communication).
>

IMO, your orchestrator across multiple machines should be separate from
what is actually doing the work, like a workflow engine, so I still
think putting the logic in its own blocking process is better.

But again, I'm talking about processing large files. I think processing
a large file is going to spend a non-trival amount of time in the event
loop, and not in your code, which is where you want it to be. For
smaller files I think the overhead of thread context switches is fine.

Anyways, my 2 cents,
/M

Sebastien Mondet

unread,
Jan 21, 2014, 12:05:03 PM1/21/14
to bio...@googlegroups.com
yes in that case I was talking about the safety benefits, the threading model should not impact much performance (anyway in unices, non-blocking I/O is often still kind-of blocking)

 

Ashish Agarwal

unread,
Jan 21, 2014, 5:33:27 PM1/21/14
to Biocaml
Philippe, I finally had a chance to read the Higher paper and understand your code. This is quite interesting work. Thanks for applying it to this problem. However, I agree that so far functors appear easier in our case.


--
You received this message because you are subscribed to the Google Groups "biocaml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biocaml+u...@googlegroups.com.
To post to this group, send email to bio...@googlegroups.com.
Visit this group at http://groups.google.com/group/biocaml.

Ashish Agarwal

unread,
Jan 23, 2014, 8:35:00 AM1/23/14
to Biocaml
I've created a new `future` library [1]. I'm using it in the `master` branch of biocaml and deleted the `future` branch.


Reply all
Reply to author
Forward
0 new messages