Performance Of Akka IO

727 views
Skip to first unread message

Reid Spencer

unread,
Feb 21, 2015, 2:17:56 PM2/21/15
to akka...@googlegroups.com
hAkkers,

One of the reasons I chose to implement my RxMongo driver on Akka was because of the performance and scalability of spray.io which was transitioned into Akka IO. Having gotten far enough with my driver to have performance results now, I have to admit I'm a little underwhelmed with them. I've performance tuned all my code until it is less than 10% of the total, the rest being what appears to be wait time in Akka IO. At an average of 561 μs per insert over 10,000 requests, I'm not dissatisfied with the performance .. except that the non-reactive Java Mongo driver does the same workload, on the same machine, with the same Mongo instance, in an average of 229 μs .. more than twice as fast!  I am using the ACK based write throttling and doing one complete request/response cycle before writing another message to Akka IO. As far as I can tell, 78% of the total time (441 microseconds) is lost in the black hole between sending the request to Akka IO and receiving a reply from it. Now, even if the Java Mongo driver was perfect (it is not), there shouldn't be more than 229 μs of delay (on average) for the I/O. So, I'm looking for at least 200 μs lost somewhere in Akka IO. I could go into further detail, but this should suffice as background for my questions. 

So, I'm hoping I've done something silly with my use of Akka IO and I have some questions to help point me in the right direction to getting the kind of performance I've come to expect with Akka:
  • I have both sampled and instrumented with YourKit. It shows 90% of the time in sun.misc.Unsafe.park .. most of the threads are doing nothing.  It was incredibly unhelpful while investigating this problem.  To get as far as I have, I've developed my own hand-instrumented profiler!  So, is there a performance analysis tool that is best suited for working with Akka?
  • Does Akka, and specifically Akka IO, have any performance analysis instrumentation in it? I've turned on DEBUG level logging but that doesn't help much with understanding performance. I turned on akka.io.tcp.trace-logging but it doesn't seem to do anything.  How can I find out how much time Akka IO is taking for its various operations short of hand instrumenting the code myself?
  • The test I'm doing is with one thread and one actor talking to the Akka IO connection. I had thought that the poor performance would get less notable with more threads and actors since the claim with Akka IO is better scalability, not necessarily better performance with a single thread.  But, actually, Akka's relative performance gets worse with higher loads, threads, actors and connections. When multiple threads, actors and connections are used, the gap between the Java Mongo driver and RxMongo widens until Java is 6-7 times faster at 50 connections/threads/actors. I'm hoping this sounds familiar to someone who can say, "you probably forgot to do X".  Anyone know what X is? :)
  • Does anyone have any performance analysis tips or tricks when working with Akka IO ?  Configuration parameters? Rules of thumb? Gotchas?  
In case it matters, my Akka ActorSystem is configured like this:

actor {
debug {
# enable DEBUG logging of all AutoReceiveMessages (Kill, PoisonPill etc.)
autoreceive = off
# enable DEBUG logging of actor lifecycle changes
lifecycle = off
# enable DEBUG logging of all LoggingFSMs for events, transitions and timers
fsm = off
# enable DEBUG logging of subscription changes on the eventStream
event-stream = off
# enable WARN logging of misconfigured routers
router-misconfiguration = off
}

# What kind of ExecutionService to use
executor = "fork-join-executor"

fork-join-executor {
# minimum number of threads to cap factor-based core number to
core-pool-size-min = 8
# No of core threads ... ceil(available processors * factor)
core-pool-size-factor = 2.0
# maximum number of threads to cap factor-based number to
core-pool-size-max = 64
}
# Throughput defines the maximum number of messages to be
# processed per actor before the thread jumps to the next actor.
# Set to 1 for as fair as possible.
throughput = 1
}

io {
tcp {
}
}

My test machine is an 8-core 3GHz Intel Xeon E5 with 64GB RAM and everything (mongo server, RxMongo) runs locally.

TIA,

Reid.

Reid Spencer

unread,
Feb 21, 2015, 4:31:39 PM2/21/15
to akka...@googlegroups.com
Looking for better instrumentation tools, I decided to try the DiagnosticActorLogging trait and generating my own MDC value containing the System.nanoTime value.
This has but confirmed my suspicions about where the time is going. I instrumented my AkkaIOChannel class which is the one that talks to Akka IO TCP OutgoingConnection and it yielded (1000s of times) this typical interaction:

20:57:28.981UTC 248981245 DEBUG [RxMongo-akka.actor.default-dispatcher-10:rxmongo.driver.AkkaIOChannel:akka://RxMongo/user/RxMongo-Supervisor-1/Connection-2/ChannelRouter-3/$a] -> doRequest
20:57:28.981UTC 248981245 DEBUG [RxMongo-akka.actor.default-dispatcher-10:rxmongo.driver.AkkaIOChannel:akka://RxMongo/user/RxMongo-Supervisor-1/Connection-2/ChannelRouter-3/$a] <- doRequest
20:57:28.981UTC 248981931 DEBUG [RxMongo-akka.actor.default-dispatcher-12:rxmongo.driver.AkkaIOChannel:akka://RxMongo/user/RxMongo-Supervisor-1/Connection-2/ChannelRouter-3/$a] Ack with queuelen=0
20:57:28.982UTC 248982310 DEBUG [RxMongo-akka.actor.default-dispatcher-4:rxmongo.driver.AkkaIOChannel:akka://RxMongo/user/RxMongo-Supervisor-1/Connection-2/ChannelRouter-3/$a] -> doReply
20:57:28.982UTC 248982310 DEBUG [RxMongo-akka.actor.default-dispatcher-4:rxmongo.driver.AkkaIOChannel:akka://RxMongo/user/RxMongo-Supervisor-1/Connection-2/ChannelRouter-3/$a] <- doReply

To decipher: the second column is the relative time in microseconds and -> means method entry while <- means method exit. 

So: 
  • my request processing takes 0μs (doRequest)
  • my reply processing takes 0μs (doReply)
  • it takes 686μs to get a write ACK from Akka IO ("Ack with queuelen=0")
  • it takes 379μs to get a reply from Mongo and Akka IO. 
Since there is no network here (TCP to 127.0.0.1) and most of the time should be going to the mongod server processing the request, I'm a little confused. Even if the full 379μs was mongod wait time and I/O time, why does it take nearly twice as long, 686μs, to write the request (< 1 kilobytes) to the socket and get an ACK back to my actor? It seems this is where the delay is. If the write was being done more quickly, RxMongo would be ahead of the Java Mongo driver by a long shot. 

So, that leads to more questions: 
  • Is there something inherently slow in using the ACK based write back pressure
  • The documentation doesn't say anything about the performance tradeoffs between the various back pressure models supported by Akka IO. Are there any? 
  • Has anyone experienced slowness in writing to an Akka IO TCP OutgoingConnection ?
Reid.


Jim Hazen

unread,
Feb 21, 2015, 5:24:40 PM2/21/15
to akka...@googlegroups.com
Hi Reid,

Sorry I don't have my IDE in front of me.  You might take a look at some of Akka's scheduling configuration options.  You mentioned Spray, and there's a benchmark example Spray app that ships with a pretty good application.conf that exposes and tweaks some of these config values, the scheduler being one of them.  AFAIK the scheduler fires completed tasks on a clock/tick based timer.  You may be bumping up against the scheduler's default clock timing threshold, which can be reduced.

Hope that helps.

-Jim

Reid Spencer

unread,
Feb 21, 2015, 5:43:25 PM2/21/15
to akka...@googlegroups.com
Hi Jim,

I will certainly track down that example Spray app and take a look; however, my understanding of the Scheduler’s tick-duration configuration field is that it only affects scheduled messages. Since I’m not using the scheduler’s interface at all, I figured that setting was irrelevant. Are you implying that Akka IO uses the scheduler under the covers? In any event, I’ll take a look at this and report back later.

In the mean time, I did some statistics on a 10,000 request sample run of my insert workload (YCSB workloada) which came out like this:

Measurement
Sum
Average
max
min
median
wait
2748551
274.827617238276
16805
147
223
write
1042354
104.22497750225
1712
46
77
read
2142823
214.260873912609
47757
98
144

Times are all in microseconds. The three measurements are:
  • wait - time between request/reply cycles (i.e. the application getting the next request ready)
  • write - the time waiting for the write ACK in Akka IO
  • read - the time waiting for Akka IO to reply with the next read message

My earlier perusal of this data (a limited visual sample) concluded that the write time was much higher than the read time but this doesn’t hold statistically over the whole sample. 

104 microseconds for a write isn’t terrible, it’s just not as good as other implementations of a mongo driver :)

Reid.

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to a topic in the Google Groups "Akka User List" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/akka-user/zI0iXocMtig/unsubscribe.
To unsubscribe from this group and all its topics, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Reid Spencer

unread,
Feb 21, 2015, 5:47:37 PM2/21/15
to akka...@googlegroups.com
Jim,

The spray-can example benchmark program has an application.conf like this:

akka {
loglevel = INFO
event-handlers = ["akka.event.slf4j.Slf4jEventHandler"]
actor.default-dispatcher {
fork-join-executor {
parallelism-min = 1
parallelism-factor = 1
#parallelism-max = 1
}
}
}
(non-akka parts elided)

The only interesting thing there is parallelism-factor which is 1 here while I use 2. I'll try it with the lower value.

Thanks,

Reid.

jimhaz...@gmail.com

unread,
Feb 21, 2015, 6:30:16 PM2/21/15
to akka...@googlegroups.com
Hmm. Then I must have been playing with the setting around the same time. It's definitely related to overall thread scheduling. Poke around the reference.conf files in the core akka jars. There should be a description of the overall methodology in there as well. Timer granularity and number of buckets are the big factors. 

-Jim

-- Sent from my mobile device
--

Reid Spencer

unread,
Feb 22, 2015, 10:47:24 AM2/22/15
to akka...@googlegroups.com
Jim,

I think you’re referring to this reference.conf which has this section of the scheduler settings:

  # Used to set the behavior of the scheduler.
  # Changing the default values may change the system behavior drastically so make
  # sure you know what you're doing! See the Scheduler section of the Akka
  # Documentation for more details.
  scheduler {
    # The LightArrayRevolverScheduler is used as the default scheduler in the
    # system. It does not execute the scheduled tasks on exact time, but on every
    # tick, it will run everything that is (over)due. You can increase or decrease
    # the accuracy of the execution timing by specifying smaller or larger tick
    # duration. If you are scheduling a lot of tasks you should consider increasing
    # the ticks per wheel.
    # Note that it might take up to 1 tick to stop the Timer, so setting the
    # tick-duration to a high value will make shutting down the actor system
    # take longer.
    tick-duration = 10ms

    # The timer uses a circular wheel of buckets to store the timer tasks.
    # This should be set such that the majority of scheduled timeouts (for high
    # scheduling frequency) will be shorter than one rotation of the wheel
    # (ticks-per-wheel * ticks-duration)
    # THIS MUST BE A POWER OF TWO!
    ticks-per-wheel = 512

I tried various combinations for these two parameters and none of it made more than a 5% difference in the numbers which is probably in the noise range for my test set up. While these settings may have consequences on a large actor system with lots of things going on, that’s not my situation currently. I have one supervisor actor talking to one channel actor talking to one TCP OutgoingConnection actor. It is pretty much linear flow of messages into and out of the Akka IO code, repeated in a loop.  I’ve measured the overheads of the actor system infrastructure (mailbox, message send, etc.) and it is negligible (less than a microsecond). 

Do you have any suggestions on how to look at the performance of Akka IO code? Short of any suggestions from the list, I”m about to start hand instrumenting the Akka IO code to find out where it spends its time. 

Reid. 

Adam

unread,
Feb 23, 2015, 2:19:46 AM2/23/15
to akka...@googlegroups.com
I think the OS you're using matters a lot for this sort of test. Hopefully it's not windows...

What was the maximum concurrency level that you've tested?
Did you make sure (in the performance test code) that this concurrency level is actually met in all cases?
Did you try tweaking with the socket options (e.g. increase the read/write buffers)?

It would be interesting indeed to know about ACK/NACK.
I'm using the approach of using NACK until I get failure => buffer until getting the resumed message => recover using ACKs.=> go back to NACK once the buffer is cleared.
I guess it means less total messages traveling back and forth.

Reid Spencer

unread,
Feb 23, 2015, 3:36:56 AM2/23/15
to akka...@googlegroups.com
Hi Adam,

Comments inline …

On Feb 23, 2015, at 2:19 AM, Adam <adam...@gmail.com> wrote:

I think the OS you're using matters a lot for this sort of test. Hopefully it's not windows…

No, it is Mac OSX Yosemite 10.10.2. I realize that absolute measurements on this kind of machine are not necessarily accurate but I’m really only interested in relative measurements. What I am doing is running the YCSB benchmark between the standard Java Mongo driver and my RxMongo driver. The standard Java driver just uses plain old thread-per-connection blocking socket communications to the mongo daemon. My RxMongo driver uses Akka IO TCP OutGoingConnection. Given that both are run on the same exact machine, I expect that comparing the two over enough iterations on an otherwise idle machine should be statistically sound. Now, if the two were less than 10% apart on their numbers, I wouldn’t have written here as that could be accounted as noise. But when the numbers are off by 100-600%, I’m pretty certain the implementation approach is at fault. 


What was the maximum concurrency level that you've tested?

I’ve varied it between 1 thread and 50 threads in the YCSB client. My machine has 8 cores, 16 with hyper threading. However, the tests I’ve been reporting have been with 1 thread. 

Did you make sure (in the performance test code) that this concurrency level is actually met in all cases?

Did you try tweaking with the socket options (e.g. increase the read/write buffers)?

The only one’s I set are the ones provided by Akka.TCP’s SocketOptions. I set:
  • tcpNoDelay : Boolean = true,
  • tcpKeepAlive : Boolean = true,
  • tcpOOBInline : Boolean = false,
 I’m unsure what the Java Mongo driver does but I’d be surprised if it did anything spectacular. If you think this is significant, I could research that. 

It would be interesting indeed to know about ACK/NACK.
I'm using the approach of using NACK until I get failure => buffer until getting the resumed message => recover using ACKs.=> go back to NACK once the buffer is cleared.
I guess it means less total messages traveling back and forth.

I suppose it would. It is certainly worth looking at. In researching interaction with mongo, however, I’ve discovered that it utilizes a strict request/response protocol (per connection) so if I write a 2nd request before the first one has finished, I lose the response to the first. Consequently, I don’t write until I have both the ACK and a response. Stacking up writes, even if they could be done faster, just doesn’t help. So, I’m not sure writing until failure would be profitable in this case.

Reid.


Akka Team

unread,
Feb 23, 2015, 4:14:33 AM2/23/15
to Akka User List
Hello Reid,

It seems you’re asking about performance of akka-streams not of akka-io, as seen by your mention of outgoingConnection - do not confuse these two modules.


The current performance of akka-streams has not been tuned at all, we know and have been quite vocal about it during it's releases.
The current akka-streams releases are meant to gather feedback mostly about API decisions. Performance has not been tuned yet *at all*, so it is to be expected for it to currently be slow.
We focused completely on getting the infrastructure and models right for the last months, having this now all fall into place, we can move on to optimisation work.

If you ("right now") want high performance for IO things, fall back to our actor based IO impl: http://doc.akka.io/docs/akka/2.3.9/scala/io.html
and do not use streaming-io for performance sensitive things unless you understand that it's under heavy development and currently will be still slow.
We'll get to optimising akka-streams very soon, and then you can expect nice performance from it.

-- Konrad

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.

To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.



--
Akka Team
Typesafe - The software stack for applications that scale
Blog: letitcrash.com
Twitter: @akkateam

Reid Spencer

unread,
Feb 23, 2015, 5:50:31 AM2/23/15
to akka...@googlegroups.com
Hi Konrad,

My apologies for making mention of OutgoingConnection. While I am ALSO working on an akka-streams implementation, I have not done performance analysis with it as I’m waiting for the M4 release, etc. I meant to simply say that I have an outgoing connection using Akka IO’s TCP manager. I understand completely about akka-streams not being optimized yet.  Currently, I am using 2.4-SNAPSHOT. Should I go back to 2.3.9 instead? I have assumed that not much would have changed in the akka-actors module between the two versions but perhaps that’s a bad assumption?

Thanks,

Reid.

You received this message because you are subscribed to a topic in the Google Groups "Akka User List" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/akka-user/zI0iXocMtig/unsubscribe.
To unsubscribe from this group and all its topics, send an email to akka-user+...@googlegroups.com.

Reid Spencer

unread,
Feb 23, 2015, 3:57:41 PM2/23/15
to akka...@googlegroups.com
So, here's some more information from my testing which just further confuses me because I’m left without a course of action to solve the performance issue. I instrumented the TcpConnection actor via DiagnosticActorLogging to record System.nanoTime on each log message using MDC and then added some logging statements into the TcpConnection, mostly to record doRead and doWrite entry and exit. I wanted to see if the problem was in the actual I/O or the TcpConnection. The results are a little odd.


appWait
doRequest
writeWait
doWrite
writeAck
mongoWait
doRead
readWait
doReply
count
10,000
10,000
10,000
10,000
10,000
10,000
10,000
10,000
10,000
sum
2,641,699.00
0.00
362,172.00
0.00
764,298.00
2,836,716.00
0.00
413,031.00
0.00
mean
264.17
0.00
36.22
0.00
76.43
283.67
0.00
41.30
0.00
median
217.00
0.00
29.00
0.00
55.00
167.00
0.00
33.00
0.00
323.37
0.00
27.95
0.00
172.19
1,162.76
0.00
25.92
0.00
min
139
0
3
0
2
125
0
15
0
max
17,719
0
892
0
14,616
22,565
0
1,389
0

The above data was computed by processing the log file that has repeating entries like this:

18:25:17.364UTC 917364997 DEBUG [RxMongo-akka.actor.default-dispatcher-10:rxmongo.driver.AkkaIOChannel:akka://RxMongo/user/RxMongo-Supervisor-1/Connection-2/ChannelRouter-3/$a] -> doRequest
18:25:17.365UTC 917364997 DEBUG [RxMongo-akka.actor.default-dispatcher-10:rxmongo.driver.AkkaIOChannel:akka://RxMongo/user/RxMongo-Supervisor-1/Connection-2/ChannelRouter-3/$a] <- doRequest
18:25:17.365UTC 917365026 DEBUG [RxMongo-akka.actor.default-dispatcher-10:akka.io.TcpOutgoingConnection:akka://RxMongo/system/IO-TCP/selectors/$a/0] Entering doWrite
18:25:17.365UTC 917365026 DEBUG [RxMongo-akka.actor.default-dispatcher-10:akka.io.TcpOutgoingConnection:akka://RxMongo/system/IO-TCP/selectors/$a/0] Leaving doWrite
18:25:17.365UTC 917365077 DEBUG [RxMongo-akka.actor.default-dispatcher-10:rxmongo.driver.AkkaIOChannel:akka://RxMongo/user/RxMongo-Supervisor-1/Connection-2/ChannelRouter-3/$a] Ack with queuelen=0
18:25:17.365UTC 917365173 DEBUG [RxMongo-akka.actor.default-dispatcher-3:akka.io.TcpOutgoingConnection:akka://RxMongo/system/IO-TCP/selectors/$a/0] Entering doRead
18:25:17.365UTC 917365173 DEBUG [RxMongo-akka.actor.default-dispatcher-3:akka.io.TcpOutgoingConnection:akka://RxMongo/system/IO-TCP/selectors/$a/0] Leaving doRead
18:25:17.365UTC 917365202 DEBUG [RxMongo-akka.actor.default-dispatcher-3:rxmongo.driver.AkkaIOChannel:akka://RxMongo/user/RxMongo-Supervisor-1/Connection-2/ChannelRouter-3/$a] -> doReply
18:25:17.365UTC 917365202 DEBUG [RxMongo-akka.actor.default-dispatcher-3:rxmongo.driver.AkkaIOChannel:akka://RxMongo/user/RxMongo-Supervisor-1/Connection-2/ChannelRouter-3/$a] <- doReply

with a shell script that computes the various times shown in the header row of the table. The second column provides a microsecond timer value based on System.nanoTime. The times computed are:
  • appWait - The time the application takes to prepare to the next request (i.e. between the last “<- doReply” and the next “-> doRequest”
  • doRequest - The time RxMongo takes to prepare the request and invoke connect!Write(msg,Ack)
  • writeWait - The time taken to get the message from RxMongo to TcpConnection (i.e. from “<- doRequest" to  “Entering doWrite”). This is how long Akka takes to process Actor.tell
  • doWrite - The time that TcpConnection.doWrite takes, including sending the Write Ack back
  • writeAck - The time taken from the end of TcpConnection.doWrite to when RxMongo receives the Write Ack (again, measures the Actor.tell time)
  • mongoWait - The time taken from when TcpConnection writes to the socket until the reply is received (measures the response time of the mongod server and operating system)
  • doRead - The time that TcpConnection.doRead takes, including sending the reply back to RxMongo
  • readWait - The time it takes from when TcpConnection.doRead finishes until the reply message arrives at the RxMongo actor (again, measuring Actor.tell)
  • doReply - The time that RxMongo uses to process the reply and forward it on to the client. 

What is really amazing is that the time for doRequest, doWrite, doRead, and doReply are all 0μs. That is, they always complete in less time than the resolution of System.nanoTime on my JVM. These measurements are where all the work and I/O calls are done and they amount to nothing in the overall latency. As one would expect, most of the time, on average, is in waiting for mongo to respond (283.67μs) but Akka messaging also plays a significant role. Waiting for the messages to transition between actors accounts for another 36.22+76.43+41.3=153.95μs.  During this run, YCSB computed an average latency of  633.7231μs which is reasonably close with the full latency I see here (283.67+153.95+264.17=701.79). 

I think the big surprise to me is how much Akka messaging impacts this test. If the time taken by Actor.tell is compared against the mongo server’s response time, it amounts to a 153.95/283.67=54% overhead!  The Akka home page claims that Akka can do "50 million msg/sec on a single machine.” I’ve had no reason to doubt that until now. At that rate, the overhead of Actor.tell would be around 20 nanoseconds which is reasonable on fast modern hardware. It should have been unmeasurable on my tests but the average wait for a message is right around 50 microseconds ( (36.22+76.43+41.30)/3 = 153.95/3 = 51.32), more than three orders of magnitude larger!

Are these overheads in Actor.tell that I’m experiencing typical in people’s experience with Akka on modern hardware with JVM 7?  I had always figured Akka’s overhead was much lower than this, but perhaps what I’m discovering here is normal? 

I’m starting to conclude that using Akka for IO is not my best strategy and I should write my own socket interactions. I really, really don’t want this to be true! And, I'm hoping someone can dissuade me from these conclusions and point out where I’ve gone wrong. :)

Because configurations seems to matter but I haven’t been able to find a setting that makes any significant difference in these numbers, I had Akka print out its configuration at startup and include it as an attachment here. I also have the raw data I used to generate the table above but since it is 3MB, I decided not to spam your Inbox. If you want that file, please ask me for it privately and I’ll send it to you (let me know which format you would like .. it is a spreadsheet). 

akka.conf.save

Adam

unread,
Feb 23, 2015, 4:58:07 PM2/23/15
to akka...@googlegroups.com
I believe 50M messages per second on a single machine was mainly an example for scaling up. See the kind of machine that was used for it.

Anyway, have you tried doing the same analysis on the "legacy" driver?
It would be interesting to compare the parts related to actual IO.

There's one thing (which may not be helpful to your current questions about Akka performance) that bothers me, but I may just be missing out on something obvious here, so I'll just ask: if Mongo doesn't allow a connection to handle more than one simultaneous request, how can any NBIO implementation do much better than a BIO implementation? I mean the test pretty much only does IO. It's not like we're using the extra free thread to do any other CPU intensive work while waiting on Mongo, right?
Doesn't it mean you have to hold some sort of connection pool so you can have more concurrent requests to Mongo in order to see any sort of benefit from using NBIO (Akka or not)?
When I use Akka IO in my application I often have lots of concurrent requests using the same socket.
They are of course kept serialized by Akka, but I can still have many requests on the same connection.
...

Reid Spencer

unread,
Feb 23, 2015, 5:15:05 PM2/23/15
to akka...@googlegroups.com
Hi Adam,


On Feb 23, 2015, at 4:58 PM, Adam <adam...@gmail.com> wrote:

I believe 50M messages per second on a single machine was mainly an example for scaling up. See the kind of machine that was used for it.

Right. While my machine is faster (2.1GHz vs 3.0GHz) than their reference hardware, it also does not have as many CPU cores and the 50 million / second was certainly using all cores in parallel.  I’m testing a very non-concurrent setup here which is why I asked if 50 microseconds was typical latency for Actor.tell. With enough concurrency, I’m sure I could drive the average number down into the nano seconds on my hardware too.  Still, three orders of magnitude is pretty significant considering the number of CPU difference is a factor of 8 not a factor of 1000. 

Anyway, have you tried doing the same analysis on the "legacy" driver?

No. I’m just comparing the average latency of the legacy driver to my RxMongo driver. The legacy java drive has always been 2-7 times faster and I’m trying to discover why by looking at my code, not theirs :) 

It would be interesting to compare the parts related to actual IO.

I have thought about just straight benchmarking the mongo raw I/O performance using the same requests but from a client written in C that just codes the wire protocol directly. Unfortunately, that is a LOT of work just to come up with a baseline. 

There's one thing (which may not be helpful to your current questions about Akka performance) that bothers me, but I may just be missing out on something obvious here, so I'll just ask: if Mongo doesn't allow a connection to handle more than one simultaneous request, how can any NBIO implementation do much better than a BIO implementation?

Typically it can’t. Mongo does allow writes that do not ask for a response to be queued without loss (it will flow control the client if its buffers get full). However, that’s not the typical case and not what the legacy driver is doing.. 

I mean the test pretty much only does IO. It's not like we're using the extra free thread to do any other CPU intensive work while waiting on Mongo, right?
Correct.

Doesn't it mean you have to hold some sort of connection pool so you can have more concurrent requests to Mongo in order to see any sort of benefit from using NBIO (Akka or not)?

Yes, and RxMongo does. You an even configure the connection pool size directly in the URL used to connect to Mongo. I can certainly use more connections and get more throughput than the test I am doing right now. However, the gap between RxMongo and the legacy driver WIDENS in that case. So, whatever is fundamentally this issue is least significant with just one connection and one client thread. Adding more threads/connections just increases the overhead Akka is contributing. 

When I use Akka IO in my application I often have lots of concurrent requests using the same socket.
They are of course kept serialized by Akka, but I can still have many requests on the same connection.

Sure. I’m not sure what Akka IO is doing under the covers with socket management and haven’t looked into this as my assumption is that it is correct.

Reid. 
 

On Monday, February 23, 2015 at 10:57:41 PM UTC+2, Reid Spencer wrote:
So, here's some more information from my testing which just further confuses me ...

Roland Kuhn

unread,
Feb 24, 2015, 5:25:07 AM2/24/15
to akka-user
Hi Reid,

currently I don’t have the bandwidth to fully understand what you are doing, but the numbers you quote here sound suspiciously like the “ping–pong problem”: if your thread pool is too big and messages always hop threads even if they should not, then you will incur the full CPU wake-up latency (which is of the order of 100µs) for each and every message send. The ping–pong benchmark (with one ping and one pong each) gets roughly 1000x faster by using only one thread for the two actors—and the reason is that these two actors cannot be active concurrently anyway.

Another thing I notice: 10,000 repetitions is by far not enough to benchmark anything on the JVM. Proper inlining and optimizations require a lot more than that. The other question is how much overhead your way of tracing introduces, because generating a log message per Actor message send will obviously double the message-sending overhead.

Regards,

Roland

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.
<akka.conf.save>
 


--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.



Dr. Roland Kuhn
Akka Tech Lead
Typesafe – Reactive apps on the JVM.
twitter: @rolandkuhn


Reid Spencer

unread,
Feb 24, 2015, 11:55:30 AM2/24/15
to akka...@googlegroups.com
Hi Roland,


> On Feb 24, 2015, at 5:23 AM, Roland Kuhn <goo...@rkuhn.info> wrote:
>
> Hi Reid,
>
> currently I don’t have the bandwidth to fully understand what you are doing,

I appreciate any response at all. Thank you!

> but the numbers you quote here sound suspiciously like the “ping–pong problem”: if your thread pool is too big and messages always hop threads even if they should not, then you will incur the full CPU wake-up latency (which is of the order of 100µs) for each and every message send.

Okay, that seems plausible and it gives me a new area of inquiry to investigate.

> The ping–pong benchmark (with one ping and one pong each) gets roughly 1000x faster by using only one thread for the two actors—and the reason is that these two actors cannot be active concurrently anyway.

I will try configuring my executor with a single thread and see if that makes it fly.

>
> Another thing I notice: 10,000 repetitions is by far not enough to benchmark anything on the JVM. Proper inlining and optimizations require a lot more than that.

That’s good to know. I can certainly try this with a million or more.

> The other question is how much overhead your way of tracing introduces, because generating a log message per Actor message send will obviously double the message-sending overhead.

Also a good point. I’ll repeat my experiments without that overhead.

>
> Regards,
>
> Roland

Reid Spencer

unread,
Feb 24, 2015, 4:56:39 PM2/24/15
to akka...@googlegroups.com
Roland,

I took all your suggestions to heart and changed my test to include:
  • Configured my FJE to use a single thread and throughput=100
  • Ran my test for 1 million iterations instead of 10,000
  • Removed all logging and profiling code
  • Used Akka 2.3.9 instead of 2.4-SNAPSHOT

This seems to have made very little difference. Since all my instrumentation is removed, all I have now is the data reported by the YCSB benchmark, as follows.

Legacy Mongo Java Driver Results:
YCSB Client 0.1
Command line: -db com.yahoo.ycsb.db.MongoDbClient -s -P workloads/workloada -threads 1 -load
Loading workload...
Starting test.
new database url = localhost:27017/ycsb
2015-02-24 15:06:12:797 0 sec: 0 operations;
mongo connection created with localhost:27017/ycsb
[OVERALL], RunTime(ms), 193841.0
[OVERALL], Throughput(ops/sec), 5158.8673190914205
[INSERT], Operations, 1000000
[INSERT], AverageLatency(us), 191.429134
[INSERT], MinLatency(us), 78
[INSERT], MaxLatency(us), 2231484
[INSERT], 95thPercentileLatency(ms), 0
[INSERT], 99thPercentileLatency(ms), 0
[INSERT], Return=0, 1000000
My RxMongo Results:
YCSB Client 0.1
Command line: -db com.reactific.ycsb.RxMongoClient -s -P workloads/workloada -load -threads 1
Loading workload...
Starting test.
2015-02-24 14:32:16:302 0 sec: 0 operations;
rxmongo connection created with mongodb://localhost:27017/ycsb?minPoolSize=2&maxPoolSize=2
[OVERALL], RunTime(ms), 305149.0
[OVERALL], Throughput(ops/sec), 3277.08758671993
[INSERT], Operations, 1000000
[INSERT], AverageLatency(us), 300.902736
[INSERT], MinLatency(us), 186
[INSERT], MaxLatency(us), 2756796
[INSERT], 95thPercentileLatency(ms), 0
[INSERT], 99thPercentileLatency(ms), 0
[INSERT], Return=0, 1000000
It is still on average about 50% slower than the legacy driver. While I think there is room for improvement here, I don’t have many more cycles to spend on performance analysis unless hAkkers have more suggestions of things to try as I am out of ideas.  

As an aside and a lark, I increased the number of threads to 10 in the client (load generator), in the Akka FJE, and in the number of actors connected to mongo. For the legacy java driver, the results were these:
[OVERALL], RunTime(ms), 196111.0
[OVERALL], Throughput(ops/sec), 5099.153030681604
[INSERT], Operations, 1000000
[INSERT], AverageLatency(us), 1939.049101
while the RxMongo driver performed like this:
[OVERALL], RunTime(ms), 232262.0
[OVERALL], Throughput(ops/sec), 4305.482601544807
[INSERT], Operations, 1000000
[INSERT], AverageLatency(us), 2262.050283

I note from these results that:
  • The discrepancy in latency narrowed from 57% to 16%
  • The throughput for the legacy driver declined slightly (2.2%) while the throughput for RxMongo increased significantly (31.4%). 

This gives me some hope that with correct tuning for a workload, and at higher levels of load, an Akka based mongo driver might just perform well, even if it suffers in corner cases like a single thread. Since my goal is to make this work on top of akka-streams, I am going to finish the RxMongo implementation while waiting for akka-streams to be released and then come back to comparing the performance of Akka IO vs. akka-streams vs. the legacy java driver after akka-streams 1.0 is released. So, unless others have further input, this thread will go idle for a couple of months (probably a good thing!) :)

My thanks and best regards to everyone who helped with this analysis.

Reid.  
Reply all
Reply to author
Forward
0 new messages