[Boost-users] [asio] io_service threadpool performance

Rutger ter Borg

unread,

Mar 13, 2009, 5:32:19 PM3/13/09

to boost...@lists.boost.org

Dear all,

I have been testing asio's io_service in a threadpool setup for job
dispatching. However, it seems as if adding threads doesn't improve
performance; perhaps even the opposite with 1 thread having the best
performance. See below for some results of a simple test I'm doing, posting
10 M jobs to the io_service, and starting N threads at io_service::run
after that. Timings below are measured on an 8-core machine, I would expect
the execution performance to improve (not to get worse) for execution by
more threads. Posting to the io_service is done single-threaded, so these
timings should remain approx. the same. Please find attached the test
program. Is there something I've missed and/or should tweak to get the
desired improvement per added thread?

Many thanks,
Kind regards,

Rutger ter Borg

Concurrency = 1
Finished posting after: 3.15
Finished execution after: 5.44
Execs / sec: 1e+07/2.29=4.36681e+06
Concurrency = 2
Finished posting after: 2.85
Finished execution after: 5.47
Execs / sec: 1e+07/2.62=3.81679e+06
Concurrency = 3
Finished posting after: 3.15
Finished execution after: 11.65
Execs / sec: 1e+07/8.5=1.17647e+06
Concurrency = 4
Finished posting after: 3.15
Finished execution after: 9.8
Execs / sec: 1e+07/6.65=1.50376e+06
Concurrency = 5
Finished posting after: 3.28
Finished execution after: 12.45
Execs / sec: 1e+07/9.17=1.09051e+06
Concurrency = 6
Finished posting after: 3.29
Finished execution after: 8.84
Execs / sec: 1e+07/5.55=1.8018e+06
Concurrency = 7
Finished posting after: 3.51
Finished execution after: 10.09
Execs / sec: 1e+07/6.58=1.51976e+06
Concurrency = 8
Finished posting after: 3.38
Finished execution after: 12.54
Execs / sec: 1e+07/9.16=1.0917e+06

asio_test.cpp

Vjekoslav Brajkovic

unread,

Mar 13, 2009, 7:06:24 PM3/13/09

to boost...@lists.boost.org

> Dear all,

Hi!

> posting
> 10 M jobs to the io_service, and starting N threads at io_service::run
> after that. Timings below are measured on an 8-core machine, I would expect

I suppose you are the one instantiating the connection.

I had the same exact symptoms a week ago... not sure if this is the same
problem. However instead of using threadpool, I was using
thread_group(). The pseudo code looked something like this:

for (int i = 0; i < n; i++) {
tg.create(bind(io_service::run, io_service_));
}

Performance was pretty poor... so to fix the issue, all I had to do was
add usleep(1000) before each call. Just for reference, it tool 13sec to
transfer a 1GB file, instead of 50.

I haven't had time to investigate what was causing the issue.

HTH. :)
vjeko

_______________________________________________
Boost-users mailing list
Boost...@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/boost-users

Rutger ter Borg

unread,

Mar 14, 2009, 4:46:58 AM3/14/09

to boost...@lists.boost.org

Vjekoslav Brajkovic wrote:
>
>> posting
>> 10 M jobs to the io_service, and starting N threads at io_service::run
>> after that. Timings below are measured on an 8-core machine, I would
>> expect
>
> I suppose you are the one instantiating the connection.
>

This is apart from any connection, it's measuring io_service as a pure job
dispatcher.

> I had the same exact symptoms a week ago... not sure if this is the same
> problem. However instead of using threadpool, I was using
> thread_group(). The pseudo code looked something like this:

Interesting -- what would be the expected performance difference between a
thread group and a pool of threads?

> I haven't had time to investigate what was causing the issue.
>

Do you know if all of io_service's jobs are dispatched through a
platform-specific dispatcher, e.g., epoll?

Thanks,

Rutger

Oliver Seiler

unread,

Mar 17, 2009, 4:39:50 PM3/17/09

to boost...@lists.boost.org

On Fri, Mar 13, 2009 at 2:32 PM, Rutger ter Borg <rut...@terborg.net> wrote:
>
> Dear all,
>
> I have been testing asio's io_service in a threadpool setup for job
> dispatching. However, it seems as if adding threads doesn't improve
> performance; perhaps even the opposite with 1 thread having the best
> performance.

Someone more familiar with the implementation could comment, but just poking
around through the implementation it appears there is one queue of handlers
that will get shared by all threads; right there I think there'll be a
lot of lock contention
between threads on the single queue.

I tried translating this example to Intel's TBB library, and start to
see concurrency
effects as I move up beyond 8 threads on my quad-core box (using parallel_for
with a blocked_range that results in a single call to f per task).
Increasing the
amount of work done on a given task (by increasing the size of the blocked_range
to parallel_for) speeds the run-time greatly, presumably because of the reduced
number of tasks and reduced context switching.

I'm guessing that the asio io_service isn't really geared towards
effective use of
multi-core CPUs where you're trying to schedule a large number of small
computational tasks; I'll go out on a limb and say that this *wasn't* the intent
of the library (as the name somewhat implies).

Not sure if that was helpful, but it let me play around with TBB,
which seems very
nice.

Cheers
Oliver

Igor R

unread,

Mar 17, 2009, 5:17:02 PM3/17/09

to boost...@lists.boost.org

> Someone more familiar with the implementation could comment, but just poking
> around through the implementation it appears there is one queue of handlers
> that will get shared by all threads; right there I think there'll be a
> lot of lock contention between threads on the single queue.

What if you use io_service-per-cpu approach? How does it affect the performance?

Rutger ter Borg

unread,

Mar 17, 2009, 5:44:28 PM3/17/09

to boost...@lists.boost.org

Oliver Seiler wrote:
> I'm guessing that the asio io_service isn't really geared towards
> effective use of
> multi-core CPUs where you're trying to schedule a large number of small
> computational tasks; I'll go out on a limb and say that this *wasn't* the
> intent of the library (as the name somewhat implies).
>
> Not sure if that was helpful, but it let me play around with TBB,
> which seems very
> nice.

You're saying that it is taken for granted that ASIO is bad at handling a
great deal of (small) tasks? Taken that for true, then asio also must be
bad at handling a large number of small network messages. I.e., I shouldn't
try handling all data of a couple of NICs with ASIO, at least not using a
threadpool setup?

I was under the impression that ASIO is a high-performance asynchronous
event and IO library, and as such, is good at everything it does... Perhaps
a lock-free task-queue would change things for the better.

Thanks for pointing out TBB, I'll take a look -- however I'm primarily
interested in taking message handling/event handling to the max.

Cheers,

Rutger

Oliver Seiler

unread,

Mar 18, 2009, 1:53:26 PM3/18/09

to boost...@lists.boost.org

On Tue, Mar 17, 2009 at 2:44 PM, Rutger ter Borg <rut...@terborg.net> wrote:
> [...]

> You're saying that it is taken for granted that ASIO is bad at handling a
> great deal of (small) tasks? Taken that for true, then asio also must be
> bad at handling a large number of small network messages. I.e., I shouldn't
> try handling all data of a couple of NICs with ASIO, at least not using a
> threadpool setup?

As I said, perhaps someone more familiar with it could comment; that
assessment was from trying out the sample code, playing around with
it, and playing around with TBB.

There appears the be a 2-lock queue implementation for the
handler_queue that is used by the io_service for dispatching handlers;
I can't really tell if it is being used or not, or if it has to be
enabled manually. This might help performance in the sample.

> I was under the impression that ASIO is a high-performance asynchronous
> event and IO library, and as such, is good at everything it does... Perhaps
> a lock-free task-queue would change things for the better.

You should maybe develop a more realistic test. The sample code was
testing the parallelism of the dispatching code in a sort of
worst-case (trivial CPU-bound operation that probably doesn't even
need to access any memory), and seemingly it doesn't scale well to
multi-core hardware. But unless you're doing something just as trivial
in the handlers as you are in the sample, so what?

Having written similar things (i.e., asynchronous message-passing
to/from network sockets), I've never really found much use,
performance-wise, for any more than one thread for dealing with
select/poll/epoll/etc on a pool of sockets, compared to the overhead
of computation/IO associated with actually doing something with what
gets read off the wire.

Anyway, I'm not trying to dissuade you from using ASIO, nor trying to
imply that it isn't high-performance (high-performance compared to
what, for example). I have no idea what you're intended use is. I do
notice that the ASIO documentation doesn't really focus on use beyond
device IO, and it seems perfectly suitable to that. I would tend to
prefer using something like TBB or a simple thread-pool implementation
for dispatching the computational work, rather than doing it in ASIO,
based solely on that documentation and having now looked a bit at the
implementation.

> Thanks for pointing out TBB, I'll take a look -- however I'm primarily
> interested in taking message handling/event handling to the max.

Depends on what happens in the event handling; if you're doing
something to disk or a database, I'd suggest you not worry about this
aspect of it. If you're doing some sort of computation, bundle up more
work per message.

Reply all

Reply to author

Forward