Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[Caml-list] Asynchronous IO programming in OCaml

550 views
Skip to first unread message

Jon Harrop

unread,
Oct 24, 2010, 6:34:54 AM10/24/10
to caml...@yquem.inria.fr
Is there a tutorial on using something like LWT for asynchronous programming
in OCaml? I'm looking for an example like an echo server that handles
clients concurrently without blocking threads, so it can handle thousands of
clients without significant performance degradation.

--
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com

_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

philippe

unread,
Oct 24, 2010, 8:52:05 AM10/24/10
to caml...@yquem.inria.fr
Not an answer to your exact question, but there's something
ressembling the reactor pattern in ocaml-nae, ocnae, which is a
pattern to look at when that kind of use become prevalent.

Dario Teixeira

unread,
Oct 24, 2010, 8:52:27 AM10/24/10
to caml...@yquem.inria.fr, Jon Harrop
Hi,

> Is there a tutorial on using something like LWT for asynchronous
> programming in OCaml? I'm looking for an example like an echo server
> that handles clients concurrently without blocking threads, so it can
> handle thousands of clients without significant performance degradation.

Lwt comes with a manual, which should be enough to get you started.
If you are looking for real-world examples, then the source of the
Ocsigen server is the most obvious place to look.

(Lwt was my first exposure to the monadic style of programming; this
caused some head-scratching in the beginning, but after a while it
became second nature. I reckon this experience might be common to
other Lwt users).

Hope that helps,
Dario Teixeira

Jake Donham

unread,
Oct 24, 2010, 12:18:27 PM10/24/10
to Jon Harrop, caml...@yquem.inria.fr
On Sun, Oct 24, 2010 at 3:34 AM, Jon Harrop <j...@ffconsultancy.com> wrote:

> Is there a tutorial on using something like LWT for asynchronous
> programming
> in OCaml? I'm looking for an example like an echo server that handles
> clients concurrently without blocking threads, so it can handle thousands
> of
> clients without significant performance degradation.
>

Not a tutorial, but here is a minimal TCP server in LWT:


http://github.com/avsm/ocaml-cohttpserver/blob/master/server/http_tcp_server.ml

Jake

oli...@first.in-berlin.de

unread,
Oct 24, 2010, 12:33:27 PM10/24/10
to caml...@yquem.inria.fr
On Sun, Oct 24, 2010 at 05:52:19AM -0700, Dario Teixeira wrote:
[...]

> (Lwt was my first exposure to the monadic style of programming; this
> caused some head-scratching in the beginning, but after a while it
> became second nature. I reckon this experience might be common to
> other Lwt users).
[...]

Can you recommend papers on monadic programming?
Or how did you mastered it?

Ciao,
Oliver

Dario Teixeira

unread,
Oct 24, 2010, 2:50:26 PM10/24/10
to caml...@yquem.inria.fr, oli...@first.in-berlin.de
Hi,

> Can you recommend papers on monadic programming?
> Or how did you mastered it?

"Mastered" it might be too strong a word... :-) Anyway, my recommendation
is to simply start using it and let practice do its thing. (In my case
practice came from developing Ocsigen/Eliom apps).

As for books or tutorials, I would suggest taking a look at material for
learning Haskell. Recently, some well-publicised Haskell books targeted
at beginners have come out [1,2]. No introduction to Haskell is really
complete without also discussing monads. (Reading Haskell is fairly
straightforward for those familiar with Ocaml, btw).

Cheers,
Dario Teixeira

[1] http://book.realworldhaskell.org/
[2] http://learnyouahaskell.com/

bluestorm

unread,
Oct 24, 2010, 3:05:08 PM10/24/10
to oli...@first.in-berlin.de, caml...@yquem.inria.fr
A very good introduction to monads for the programmer, in my opinion, is
"Monads for functional programming", by Philip Wadler

http://homepages.inf.ed.ac.uk/wadler/papers/marktoberdorf/baastad.pdf

If one wish to stay in OCaml country, there is a blog post by Brian Hurt :

http://enfranchisedmind.com/blog/posts/a-monad-tutorial-for-ocaml/


On Sun, Oct 24, 2010 at 8:50 PM, Dario Teixeira <dariot...@yahoo.com>
wrote:

oli...@first.in-berlin.de

unread,
Oct 24, 2010, 4:02:29 PM10/24/10
to Dario Teixeira, caml...@yquem.inria.fr
On Sun, Oct 24, 2010 at 11:50:16AM -0700, Dario Teixeira wrote:
[...]
> As for books or tutorials, I would suggest taking a look at material for
> learning Haskell.

heheh, that would have been my next question...

..I may refresh my Haskell, and climb the big mountain
of Monads there (in Haskell-land), where I last time
stopped climbing that mountain and turned to OCaml. ;)


> Recently, some well-publicised Haskell books targeted
> at beginners have come out [1,2].

Thanks for the hint.


Ciao,
Oliver

Phil Tomson

unread,
Oct 24, 2010, 4:13:04 PM10/24/10
to
On Oct 24, 3:34 am, "Jon Harrop" <j...@ffconsultancy.com> wrote:
> Is there a tutorial on using something like LWT for asynchronous programming
> in OCaml? I'm looking for an example like an echo server that handles
> clients concurrently without blocking threads, so it can handle thousands of
> clients without significant performance degradation.
>
> --

There's node.ocaml: http://github.com/mathgladiator/node.ocaml

Phil

Goswin von Brederlow

unread,
Oct 24, 2010, 4:42:23 PM10/24/10
to Jon Harrop, caml...@yquem.inria.fr
"Jon Harrop" <j...@ffconsultancy.com> writes:

> Is there a tutorial on using something like LWT for asynchronous programming
> in OCaml? I'm looking for an example like an echo server that handles
> clients concurrently without blocking threads, so it can handle thousands of
> clients without significant performance degradation.

Sorry, not a tutorial but maybe it gives you some pointers for research.

What blockage do you want to remove?

1) sockets

You can set sockets to non-blocking and then use Unix.select to wait for
activity. But that becomes inefficient if you really want thousands of
clients. This is really nothing ocaml specific but the same old select
problem as in C.

The solution (in C) is to use epoll and I haven't seen an ocaml binding
for that yet. Anyone?

Short of that mixing threads and select is an option.

2) disk I/O

There are (under linux) 3 choices for async disk I/O:

a) glibc -> internally threads
b) librt -> internally threads
c) libaio -> needs O_DIRECT to work on files

I've written some bindings for libaio and that works nicely.

No idea about that for windows or mac.

3) computations

Here you hit a bit of a problem. Ocaml doesn't support multiple
cores. (there is a multicore impementation but not standard.) So the
best you can do is split computations into little chunks and do time
sharing. Or offload jobs to worker threads that are not written in
ocaml.

What I like here is using CPS (continuation passing style). The idea is
that every function gets passed an argument, the continuation, which is
to be called next. If you have used Scanf.scanf then it is a bit like
that. A function only returns when it blocks, in which case it stores a
continuation in some queue to be continued later.

A pure echo server can be done fine with just a select loop and any C
example is easily translated to ocaml. Maybe try that and then look at a
bigger problem and ask more specific questions.

MfG
Goswin

Anil Madhavapeddy

unread,
Oct 24, 2010, 4:55:03 PM10/24/10
to Jake Donham, Jon Harrop, caml...@yquem.inria.fr

This should work fine for a couple of thousand clients or so, but you'll begin to see degradation as the number of clients increase. This is because LWT internally uses select(2) to wait for file-descriptors, and not the newer kqueue(2) or epoll(2) interfaces. You can read more about the "C10K problem" here: http://www.kegel.com/c10k.html

I've got a stripped-down version of LWT that uses these newer event-driven kernel interfaces (and so should be able to cross the 10,000 client barrier fairly easily), but it won't be ready for release for another month or so. Drop me an email off-list if you want to try it out earlier.

Async disk I/O under Linux is annoyingly problematic if you aren't using direct I/O (and hence page/block-aligned structures). Avoid if possible :)

Anil

Michael Ekstrand

unread,
Oct 24, 2010, 5:51:13 PM10/24/10
to oli...@first.in-berlin.de, caml...@yquem.inria.fr
On 10/24/2010 11:33 AM, oli...@first.in-berlin.de wrote:
> On Sun, Oct 24, 2010 at 05:52:19AM -0700, Dario Teixeira wrote:
> [...]
>> (Lwt was my first exposure to the monadic style of programming; this
>> caused some head-scratching in the beginning, but after a while it
>> became second nature. I reckon this experience might be common to
>> other Lwt users).
> [...]
>
> Can you recommend papers on monadic programming?
> Or how did you mastered it?

I recommend learning it by just doing Lwt programming. Once you've
learned how to use Lwt, then it is much easier IMO to understand the
wider world of monads.

- Michael

Jérémie Dimino

unread,
Oct 24, 2010, 6:51:59 PM10/24/10
to Anil Madhavapeddy, Jon Harrop, caml...@yquem.inria.fr
On Sun, Oct 24, 2010 at 01:54:50PM -0700, Anil Madhavapeddy wrote:
> This should work fine for a couple of thousand clients or so, but you'll
> begin to see degradation as the number of clients increase. This is
> because LWT internally uses select(2) to wait for file-descriptors, and
> not the newer kqueue(2) or epoll(2) interfaces. You can read more about
> the "C10K problem" here:�[3]http://www.kegel.com/c10k.html

> I've got a stripped-down version of LWT that uses these newer event-driven
> kernel interfaces (and so should be able to cross the 10,000 client
> barrier fairly easily), but it won't be ready for release for another
> month or so. �Drop me an email off-list if you want to try it out earlier.

I made an implementation of lwt using libev [1]. I tested it with
ocsigen and ab but the result was always a bit better with select than
with epoll. That is why i did not replace select by libev in the main
branch. In fact i never found the source of any benchmark comparing
select to epoll on the web.

> Async disk I/O under Linux is annoyingly problematic if you aren't using
> direct I/O (and hence page/block-aligned structures). Avoid if possible :)

The next version of Lwt will support asynchronous disk I/O by using
mincore + mmap. It is already in the development version.

J�r�mie

[1] http://www.dimino.org/cgi-bin/darcsweb.cgi?r=lwt-libev;a=summary

Markus Mottl

unread,
Oct 24, 2010, 11:42:53 PM10/24/10
to Jérémie Dimino, Jon Harrop, caml...@yquem.inria.fr, Anil Madhavapeddy
On Sun, Oct 24, 2010 at 18:50, J�r�mie Dimino <jer...@dimino.org> wrote:
> I made an implementation of lwt using libev [1]. I tested it with
> ocsigen and ab but the result was always a bit better with select than
> with epoll. That is why i did not replace select by libev in the main
> branch. In fact i never found the source of any benchmark comparing
> select to epoll on the web.

The performance of select was also usually slightly better in my
experiments than with epoll for at least a few tens of descriptors.
It really depends on what your requirements are. If you are facing
hundreds or even thousands of connections, you'll probably want to
consider epoll. select does not scale well.

Regards,
Markus

--
Markus Mottl� � � � http://www.ocaml.info� � � � markus...@gmail.com

Richard Jones

unread,
Oct 25, 2010, 3:50:06 AM10/25/10
to Markus Mottl, Jon Harrop, caml...@yquem.inria.fr, Jérémie Dimino, Anil Madhavapeddy
On Sun, Oct 24, 2010 at 11:42:45PM -0400, Markus Mottl wrote:
> On Sun, Oct 24, 2010 at 18:50, J�r�mie Dimino <jer...@dimino.org> wrote:
> > I made an implementation of lwt using libev [1]. I tested it with
> > ocsigen and ab but the result was always a bit better with select than
> > with epoll. That is why i did not replace select by libev in the main
> > branch. In fact i never found the source of any benchmark comparing
> > select to epoll on the web.
>
> The performance of select was also usually slightly better in my
> experiments than with epoll for at least a few tens of descriptors.
> It really depends on what your requirements are. If you are facing
> hundreds or even thousands of connections, you'll probably want to
> consider epoll. select does not scale well.

I think if you really had 10,000 clients per server you'd probably
want to consider your whole architecture. Putting nginx or a very
cut-down Apache on the front and memcached between the webserver and
the database.

Rich.

--
Richard Jones
Red Hat

Goswin von Brederlow

unread,
Oct 25, 2010, 4:42:12 AM10/25/10
to caml...@yquem.inria.fr
J�r�mie Dimino <jer...@dimino.org> writes:

> On Sun, Oct 24, 2010 at 01:54:50PM -0700, Anil Madhavapeddy wrote:
>> Async disk I/O under Linux is annoyingly problematic if you aren't using
>> direct I/O (and hence page/block-aligned structures). Avoid if possible :)
>
> The next version of Lwt will support asynchronous disk I/O by using
> mincore + mmap. It is already in the development version.
>
> J�r�mie

Doesn't that mean you have to do polling of all pending I/O? That seems
horrible ineficient (you check them all every time) or slow I/Os can
starve quick ones (you only check the oldest ones).

The nice thing about libaio is that you get events as I/O completes and
you get many events in one simple syscall. Doing thousands of I/O
requests in parallel is trivial there.

MfG
Goswin

Jérémie Dimino

unread,
Oct 25, 2010, 7:10:30 AM10/25/10
to Goswin von Brederlow, caml...@yquem.inria.fr
On Mon, Oct 25, 2010 at 10:42:05AM +0200, Goswin von Brederlow wrote:
> Doesn't that mean you have to do polling of all pending I/O? That seems
> horrible ineficient (you check them all every time) or slow I/Os can
> starve quick ones (you only check the oldest ones).

No. In the current implementation, when we want to read a file, we mmap
it, then test it with mincore. If it is not cached in memory, we first
wait a bit, test again and if it is still not available, we launch a
thread which try to read the first byte of the mmapped buffer.

If the file is not cached, then the time needed to launch a thread is
negligible compared to the time needed to fetch data from the disk.

> The nice thing about libaio is that you get events as I/O completes and
> you get many events in one simple syscall. Doing thousands of I/O
> requests in parallel is trivial there.

AFAIK libaio is linux-specific and not impelemted for all
filesystems. Moreover it transparently fallback to synchronous IOs when
asynchronous IOs are not available, so there is no reliable way to test
whether IOs will block or not.

J�r�mie

Yaron Minsky

unread,
Oct 25, 2010, 11:34:48 AM10/25/10
to Jérémie Dimino, caml...@inria.fr
Sorry, didn't meant to take this off list. Taking it back to the caml list.

I don't quite understand how this whole benchmark holds together. Could you
post the C code? I don't understand the differences between (1), (2) and
(3) well enough to explain where the factor of 100 comes in.

y

On Mon, Oct 25, 2010 at 10:33 AM, Jérémie Dimino <jer...@dimino.org> wrote:

> On Mon, Oct 25, 2010 at 07:59:17AM -0400, Yaron Minsky wrote:
> > What's the advantage of using mmap here?� Why not just have a few
> worker
> > threads available for doing file io?� Then you can use the ordinary
> file
> > API, and you don't need to spin up a brand new thread for each
> request.
>
> It because the cost of switching to another thread is too big. I did the
> following benchmarks some time ago (in C): i made three programs reading
> a 100Mo file using a loop in the following way:
>
> 1 - just use read, without threads
> 2 - launch a thread at each iteration
> 3 - use only one thread created before entering the loop
>
> And the results were the following:
>
> - in any case (2) and (3) gave the same result,
> - when the file was not in the cache, the execution time was the same
> for the three programs,
> - when the file was in the cache, (2) and (3) were about 100 times
> slower than (1)
>
> Jérémie
>

DS

unread,
Oct 25, 2010, 11:58:14 AM10/25/10
to caml...@yquem.inria.fr
On 25 Oct 2010, at 13:10, J�r�mie Dimino wrote:
> AFAIK libaio is linux-specific

If we are talking about aio_read() and company, these functions are available on at least FreeBSD, Irix, Solaris and Mac OS X too. I looked into this some years back and while I would say it is standard unix stuff, I do recall finding a few slight differences between platforms. FreeBSD required loading a kernel module (aio.ko IIRC) to use this, and if it wasn't loaded the program would just die horribly.

It can be wrapped as an OCaml module, but expect to put many #ifdef's in your C code.

Regards.
-DS

Jérémie Dimino

unread,
Oct 25, 2010, 1:26:42 PM10/25/10
to Yaron Minsky, caml...@inria.fr
On Mon, Oct 25, 2010 at 11:34:41AM -0400, Yaron Minsky wrote:
> I don't quite understand how this whole benchmark holds together.� Could
> you post the C code?� I don't understand the differences between (1), (2)
> and (3) well enough to explain where the factor of 100 comes in.

Yes. Here is the code of the first program:

,----
| #include <sys/types.h>
| #include <sys/stat.h>
| #include <fcntl.h>
| #include <unistd.h>
|
| int main()
| {
| int fd = open("data", O_RDONLY);
| char buffer[4096];
|
| while (read(fd, buffer, 4096) > 0);
|
| close(fd);
|
| return 0;
| }
`----

the code of the second:

,----
| #include <sys/types.h>
| #include <sys/stat.h>
| #include <fcntl.h>
| #include <unistd.h>
| #include <pthread.h>
|
| int fd;
| char buffer[4096];
| int done = 0;
|
| void *callback(void* data)
| {
| int count = read(fd, buffer, 4096);
| if (count == 0) done = 1;
| return NULL;
| }
|
| int main()
| {
| fd = open("data", O_RDONLY);
|
| while (!done) {
| pthread_t thread;
| pthread_create(&thread, NULL, callback, NULL);
| pthread_join(thread, NULL);
| }
|
| close(fd);
|
| return 0;
| }
`----

and the third:

,----
| #include <sys/types.h>
| #include <sys/stat.h>
| #include <fcntl.h>
| #include <unistd.h>
| #include <pthread.h>
|
| int fd;
| char buffer[4096];
| int done = 0;
| pthread_cond_t start = PTHREAD_COND_INITIALIZER;
| pthread_cond_t stop = PTHREAD_COND_INITIALIZER;
| pthread_mutex_t start_mutex = PTHREAD_MUTEX_INITIALIZER;
| pthread_mutex_t stop_mutex = PTHREAD_MUTEX_INITIALIZER;
|
| void *callback(void* data)
| {
| while (!done) {
| pthread_cond_wait(&start, &start_mutex);
|
| int count = read(fd, buffer, 4096);
| if (count == 0) done = 1;
|
| pthread_mutex_lock(&stop_mutex);
| pthread_cond_signal(&stop);
| pthread_mutex_unlock(&stop_mutex);
| }
| return NULL;
| }
|
| int main()
| {
| fd = open("data", O_RDONLY);
|
| pthread_cond_init(&start, NULL);
| pthread_cond_init(&stop, NULL);
|
| pthread_mutex_lock(&start_mutex);
| pthread_mutex_lock(&stop_mutex);
|
| pthread_t thread;
| pthread_create(&thread, NULL, callback, NULL);
|
| while (!done) {
| pthread_mutex_lock(&start_mutex);
| pthread_cond_signal(&start);
| pthread_mutex_unlock(&start_mutex);
|
| pthread_cond_wait(&stop, &stop_mutex);
| }
|
| pthread_join(thread, NULL);
| close(fd);
|
| return 0;
| }
`----

Goswin von Brederlow

unread,
Oct 27, 2010, 5:37:03 AM10/27/10
to Jeremie Dimino, Yaron Minsky, caml...@inria.fr
J�r�mie Dimino <jer...@dimino.org> writes:

> On Mon, Oct 25, 2010 at 11:34:41AM -0400, Yaron Minsky wrote:
>> I don't quite understand how this whole benchmark holds together.� Could
>> you post the C code?� I don't understand the differences between (1), (2)
>> and (3) well enough to explain where the factor of 100 comes in.
>
> Yes. Here is the code of the first program:
>
> ,----
> | #include <sys/types.h>
> | #include <sys/stat.h>
> | #include <fcntl.h>
> | #include <unistd.h>
> |
> | int main()
> | {
> | int fd = open("data", O_RDONLY);
> | char buffer[4096];
> |
> | while (read(fd, buffer, 4096) > 0);
> |
> | close(fd);
> |
> | return 0;
> | }
> `----

Obvious example so nothing to comment. :)

> the code of the second:
>
> ,----
> | #include <sys/types.h>
> | #include <sys/stat.h>
> | #include <fcntl.h>
> | #include <unistd.h>
> | #include <pthread.h>
> |
> | int fd;
> | char buffer[4096];
> | int done = 0;
> |
> | void *callback(void* data)
> | {
> | int count = read(fd, buffer, 4096);
> | if (count == 0) done = 1;
> | return NULL;
> | }
> |
> | int main()
> | {
> | fd = open("data", O_RDONLY);
> |
> | while (!done) {
> | pthread_t thread;
> | pthread_create(&thread, NULL, callback, NULL);
> | pthread_join(thread, NULL);
> | }
> |
> | close(fd);
> |
> | return 0;
> | }
> `----

You aren't doing any multithreading. You are creating a thread and
waiting for the thread to finish its read before strating a second.
There are never ever 2 reads running in parallel. So all you do is add
thread creation and destruction for every read to your first example.

You should start multiple threads and let them read from different
offsets (use pread) and only once they are all started join them all
again.

Again no parallelism at all. Instead of thread creation and destruction
you now add mutexes between the reads. Again the only expected result is
a slowdown.

You should create X threads at the start and then repeadately give them
work (an offset to read). Only when they are all busy you wait for
one of them to become idle again.

> J�r�mie


So far you have failed to do asynchronous IO. You examples all wait for
the read to complete making it synchronous and then threads are
obviously slower.

Also sequential reads from a file should result in sequential reads from
the disk (unless the filesystem fragments the file a lot or you created
it that way). That is the ideal situation for the disks and kernels
read-ahead. One advantage of doing many read/writes asynchronous is that
the kernel can reorder the requests and potentially merge
requests. Unless you crafted your input file to be fragmented you won't
see this effect in a test with sequential reads. And that effect would
be the only thing making multithreaded reads faster in a test like the
above.

MfG
Goswin

Jérémie Dimino

unread,
Oct 27, 2010, 7:18:49 AM10/27/10
to Goswin von Brederlow, Yaron Minsky, caml...@inria.fr
On Wed, Oct 27, 2010 at 11:33:51AM +0200, Goswin von Brederlow wrote:
> You aren't doing any multithreading. You are creating a thread and
> waiting for the thread to finish its read before strating a second.
> There are never ever 2 reads running in parallel. So all you do is add
> thread creation and destruction for every read to your first example.

Yes, i know that. The idea was just to show the overhead of context
switches.

> You should start multiple threads and let them read from different
> offsets (use pread) and only once they are all started join them all
> again.

Sure, but doing this directly in Lwt raises other problems:

- This means prefetching a large part of the file into the program
memory.

- The kernel already prefetches files from the disk, so most of the time
this is just a memory copy in parallel...

- How many threads do we launch in parallel ? This depends on the
overhead of context switching between threads, which can not be
determined at runtime.

I think that the solution with mincore + mmap is better because it uses
threads only when really needed.

J�r�mie

Goswin von Brederlow

unread,
Oct 27, 2010, 9:43:50 AM10/27/10
to caml...@inria.fr
J�r�mie Dimino <jer...@dimino.org> writes:

> On Wed, Oct 27, 2010 at 11:33:51AM +0200, Goswin von Brederlow wrote:
>> You aren't doing any multithreading. You are creating a thread and
>> waiting for the thread to finish its read before strating a second.
>> There are never ever 2 reads running in parallel. So all you do is add
>> thread creation and destruction for every read to your first example.
>
> Yes, i know that. The idea was just to show the overhead of context
> switches.

But then you tune your benchmark to give you the result you want instead
of benchmarking what normaly happens.

You already have a context switch when the read returns (or any other
syscall that blocks). The context switch to a thread that waits on it
costs the same as switching to the main thread. When it blocked at
least.

>> You should start multiple threads and let them read from different
>> offsets (use pread) and only once they are all started join them all
>> again.
>
> Sure, but doing this directly in Lwt raises other problems:
>
> - This means prefetching a large part of the file into the program
> memory.

Yes. totaly. But if you want to do work asynchonously then you need to
spend the memory to keep the data for multiple jobs in memory.

> - The kernel already prefetches files from the disk, so most of the time
> this is just a memory copy in parallel...

That only works for sequential reads on a verry limited number of files
(if more than one at all). If you have 1000+ clients requesting files
from a webserver for example they will never be covered by the kernels
read-ahead.

And don't forget. Multi core systems are more and more widely
spread. What is wrong with 2 cores doing memcpy twice as fast?

> - How many threads do we launch in parallel ? This depends on the
> overhead of context switching between threads, which can not be
> determined at runtime.
>
> I think that the solution with mincore + mmap is better because it uses
> threads only when really needed.
>

> J�r�mie

For reads I have to agree with you there. You can only hide the cost of
a thread when the read blocks. By using mincore to test if a read would
block first you avoid context switches where they aren't free.

Unfortunately you can't use mmap() for write (efficiently). Writing to a
mmap()ed file will first read in the old data from disk, then overwrite
the memory and sometime later write it back to disk. There seems to be
no way to tell the kernel to skip reading in a page on wirst access
because one is going to completly overwrite it anyway.


Do you have a different solution for writes that will avoid threads when
a write won't block? And how do you restart jobs once the data has
actually been commited to disk? (i.e. how do you do fsync()?)

MfG
Goswin

Jérémie Dimino

unread,
Oct 27, 2010, 11:30:30 AM10/27/10
to Goswin von Brederlow, caml...@inria.fr
On Wed, Oct 27, 2010 at 03:43:37PM +0200, Goswin von Brederlow wrote:
> But then you tune your benchmark to give you the result you want instead
> of benchmarking what normaly happens.

That's true for read or write. The point i wanted to make is that in the
general case, i.e. for a system call that blocks, wrapping it into a
thread might not be a good solution because it degrades performances too
much.

> And don't forget. Multi core systems are more and more widely
> spread. What is wrong with 2 cores doing memcpy twice as fast?

I don't think you can read or write in parallel the RAM with current
architectures, but i may be wrong.

> Do you have a different solution for writes that will avoid threads when
> a write won't block? And how do you restart jobs once the data has
> actually been commited to disk? (i.e. how do you do fsync()?)

Unfortunately no, we have no solution for write. The main problem i see
with doing write in parallels with threads is to handle errors.

J�r�mie

Goswin von Brederlow

unread,
Oct 28, 2010, 5:01:08 AM10/28/10
to Jeremie Dimino, caml...@inria.fr
J�r�mie Dimino <jer...@dimino.org> writes:

> On Wed, Oct 27, 2010 at 03:43:37PM +0200, Goswin von Brederlow wrote:
>> But then you tune your benchmark to give you the result you want instead
>> of benchmarking what normaly happens.
>
> That's true for read or write. The point i wanted to make is that in the
> general case, i.e. for a system call that blocks, wrapping it into a
> thread might not be a good solution because it degrades performances too
> much.

Hehe, and you have prooven yourself wrong. As you said, when the file
isn't cache and the syscall actually blocks there is no time
difference. The reasons being that the blokcing takes the majority of
time anyway and the context switch for the thread on the read doesn't
cost anything since the kernel needs to context switch anyway. :)

>> And don't forget. Multi core systems are more and more widely
>> spread. What is wrong with 2 cores doing memcpy twice as fast?
>
> I don't think you can read or write in parallel the RAM with current
> architectures, but i may be wrong.

NUMA has memory dedicated to each core. Each core can access its memory
fast. Accessing some other cores memory on the other hand is slower (and
will have to fight for access with that core).

>> Do you have a different solution for writes that will avoid threads when
>> a write won't block? And how do you restart jobs once the data has
>> actually been commited to disk? (i.e. how do you do fsync()?)
>
> Unfortunately no, we have no solution for write. The main problem i see
> with doing write in parallels with threads is to handle errors.
>
> J�r�mie

Nah, if you can detect the error then handing is easy. The problem is
detecting. Write() itself can fail and that is easy to detect. But
write() succeeding doesn't mean the data has been written
successfully. You can still get an error after that which you only get
on fsync().

MfG
Goswin

Jérémie Dimino

unread,
Oct 28, 2010, 5:29:01 AM10/28/10
to Goswin von Brederlow, caml...@inria.fr
On Thu, Oct 28, 2010 at 11:00:59AM +0200, Goswin von Brederlow wrote:
> Hehe, and you have prooven yourself wrong. As you said, when the file
> isn't cache and the syscall actually blocks there is no time
> difference. The reasons being that the blokcing takes the majority of
> time anyway and the context switch for the thread on the read doesn't
> cost anything since the kernel needs to context switch anyway. :)

The kernel will always prefetch a part of the file, so the first read
will launch a needed thread but not subsequent reads until the offset
reach a part of the file not yet cached.

And again, for other system calls, such as fstat, launching a thread is
not a good solution.

> NUMA has memory dedicated to each core. Each core can access its memory
> fast. Accessing some other cores memory on the other hand is slower (and
> will have to fight for access with that core).

And so that is a bad solution for Lwt since it runs on one system
thread.

> Nah, if you can detect the error then handing is easy. The problem is
> detecting. Write() itself can fail and that is easy to detect. But
> write() succeeding doesn't mean the data has been written
> successfully. You can still get an error after that which you only get
> on fsync().

Take for example buffered channels, when you would have to flush the
buffer, you would launch a thread calling write and let the program
continue to write onto the channel, assuming that the write succeed. But
if one write fails, you would get the error latter. And worst, if the
program stop using the buffer after the flush, it will never get the
error.

J�r�mie

Goswin von Brederlow

unread,
Oct 28, 2010, 6:12:02 AM10/28/10
to Jeremie Dimino, caml...@inria.fr
J�r�mie Dimino <jer...@dimino.org> writes:

> On Thu, Oct 28, 2010 at 11:00:59AM +0200, Goswin von Brederlow wrote:
>> NUMA has memory dedicated to each core. Each core can access its memory
>> fast. Accessing some other cores memory on the other hand is slower (and
>> will have to fight for access with that core).
>
> And so that is a bad solution for Lwt since it runs on one system
> thread.

Does that matter? The copying would be going on in the kernel and use
multiple cores even if the read requests all came from one core I think.

>> Nah, if you can detect the error then handing is easy. The problem is
>> detecting. Write() itself can fail and that is easy to detect. But
>> write() succeeding doesn't mean the data has been written
>> successfully. You can still get an error after that which you only get
>> on fsync().
>
> Take for example buffered channels, when you would have to flush the
> buffer, you would launch a thread calling write and let the program
> continue to write onto the channel, assuming that the write succeed. But
> if one write fails, you would get the error latter. And worst, if the
> program stop using the buffer after the flush, it will never get the
> error.
>

> J�r�mie

True. Your program will have to handle errors.

Normaly I would say that a flush/sync on a channel/file has to block the
thread and continue only on error or success. Otherwise you would be
talking more about barriers and simulate them with flush/sync (since
userspace lacks an interface for them). Doing error handling with
barriers is indeed more tricky.

But I was more thinking of having async writes with error reporting. Say
you have 1000 "threads" that need to write something. They all call
my_write(). The my_write() function issues an async write request,
blocks the "thread" and continues some other "thread". Once the write
request has complete the "thread" is woken up and gets a success or
failure message. "thread" here doesn't mean an actual system
thread. More a logical thread like handling of a specific connection.

Even more async would be to pass a callback to the my_write() that gets
called on success or failure and have my_write() return directly. Again
error hadnling is easy since you have the callback to handle it. So
my_write would look something like this:

let my_write fd off str fn = ...
val my_write : Unix.file_descr Int64.t string (Unix.error option -> unit) -> unit
(** Write string <str> to file <fd> at offset <off> and call <fn> with
None on success or Some error on failure. *)

The program can then decide to continue (do something else after
my_write) or to block (put the rest of the code into the callback and
return).

MfG
Goswin

0 new messages