Bad scaling on benchmark comparision. Is there something wrong ?

Rüdiger Möller

unread,

Jan 6, 2014, 7:26:11 PM1/6/14

to akka...@googlegroups.com

Please checkout chart here.

https://plus.google.com/109956740682506252532/posts/1hKcYyPuJzh

[cut&pasted from g+]:

Hey folks, i am currently writing a blog benchmarking akka vs traditional threading. I use the example provided by the akka java tutorial computing Pi. In order to compare the abillity to paralellize big amounts of tiny jobs, i use Pi-computaional slices of 100,000 jobs with iteration of 1000.
Hardware is dual socket AMD opteron with each 8 real cores and 8 'virtual' (because the test uses floating point i just scale to 16 threads instead of 32).

As you can see in the chart AKKA (2.03) performs very bad compared to threads and a homebrew actor lib.

source of akka bench is here: https://gist.github.com/RuedigerMoeller/8272966
(added outer loop to original typesafe sample)

Is there anything I miss or is this 'normal' Akka performance ?

Threading-style code is here: https://gist.github.com/RuedigerMoeller/8273307

I tried 2.1 with even worse results.

http://imgur.com/TAt9XOf

Endre Varga

unread,

Jan 7, 2014, 5:49:49 AM1/7/14

to akka...@googlegroups.com

Hi Rüdiger,

Have you tried to play around with the throughput setting of the dispatcher? For these kind of non-interactive jobs fairness is not an issue, so you might most likely want to increase that value.

-Endre

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://akka.io/faq/
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/groups/opt_out.

√iktor Ҡlang

unread,

Jan 7, 2014, 5:57:54 AM1/7/14

to Akka User List

You definitely want to play around with the configuration and make sure that you are benchmarking correctly: https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java

dispatcher throughput, thread pool type and thread pool size, and mailbox type.

Also, Opterons have pretty bad cache performance for inter-core comms (Intel uses inclusive L3s for faster on-package caches)

Cheers,

√

--

Cheers,

√

Viktor Klang
Director of Engineering

Typesafe

Twitter: @viktorklang

hschoeneberg

unread,

Jan 7, 2014, 6:49:08 AM1/7/14

to akka...@googlegroups.com

Hey Rüdiger,

you should probably reconsider your application's design:

In your calculate()-method you keep creating ActorSystems and spawn an actor responsible for the actual computation. You should prefer creating and keeping one ActorSystem for your benchmark and just spawn an actor to handle your calculation request (i.e. your master). After completing the calculation you let the actor die. This alone would probably boost the performance. Also: There is no 1:1 correlation between actors and threads, you can create way more actors which - depending on the dispatcher - use a given thread pool for execution.

Kind regards,

Hendrik

Endre Varga

unread,

Jan 7, 2014, 6:57:40 AM1/7/14

to akka...@googlegroups.com

In your calculate()-method you keep creating ActorSystems and spawn an actor responsible for the actual computation.

Ah, good observation, we missed that one.

You should prefer creating and keeping one ActorSystem for your benchmark and just spawn an actor to handle your calculation request (i.e. your master). After completing the calculation you let the actor die. This alone would probably boost the performance. Also: There is no 1:1 correlation between actors and threads, you can create way more actors which - depending on the dispatcher - use a given thread pool for execution.

Kind regards,
Hendrik

Am Dienstag, 7. Januar 2014 01:26:11 UTC+1 schrieb Rüdiger Möller:

Please checkout chart here.

https://plus.google.com/109956740682506252532/posts/1hKcYyPuJzh

[cut&pasted from g+]:
Hey folks, i am currently writing a blog benchmarking akka vs traditional threading. I use the example provided by the akka java tutorial computing Pi. In order to compare the abillity to paralellize big amounts of tiny jobs, i use Pi-computaional slices of 100,000 jobs with iteration of 1000.
Hardware is dual socket AMD opteron with each 8 real cores and 8 'virtual' (because the test uses floating point i just scale to 16 threads instead of 32).

As you can see in the chart AKKA (2.03) performs very bad compared to threads and a homebrew actor lib.

source of akka bench is here: https://gist.github.com/RuedigerMoeller/8272966
(added outer loop to original typesafe sample)

Is there anything I miss or is this 'normal' Akka performance ?

Threading-style code is here: https://gist.github.com/RuedigerMoeller/8273307

I tried 2.1 with even worse results.

http://imgur.com/TAt9XOf

--

Rüdiger Möller

unread,

Jan 7, 2014, 7:16:20 AM1/7/14

to akka...@googlegroups.com

1) The actorsystem creation is in the outer loop (VM warmup), your claim is not true, re-using the actor system does not change the results, as each Pi computation task is >500 ms, so actor system creation is neglectible (<1ms). Note that one iteration refers to a complete concurrent multiple-slice Pi calculation. I am looping only to get correct VM warmup and measure the average duration of last 10 runs.

2) I can see that Akka makes use of all CPU's. From a users point of view, the benchmark is about handling many small computation jobs (pi-computation slices) in concurrent with contention created when assembling the result. This is the original AKKA sample on how to do this.

Rüdiger Möller

unread,

Jan 7, 2014, 7:16:57 AM1/7/14

to akka...@googlegroups.com

Bad observation, as did not read program correctly ..

Rüdiger Möller

unread,

Jan 7, 2014, 7:32:50 AM1/7/14

to akka...@googlegroups.com

Am Dienstag, 7. Januar 2014 11:57:54 UTC+1 schrieb √:

You definitely want to play around with the configuration and make sure that you are benchmarking correctly: https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java

I know the details regarding VM warmup and how to avoid in-place JIT'ing. I am iterating the bench 30 times, only take the average times of last 10 iterations (20 iterations warmup), see mainloop i added to the original AKKA sample. Also the benchmark is not in the nanos, but several hundred millis per iteration.

dispatcher throughput, thread pool type and thread pool size, and mailbox type.

can you be more specific pls (Code!). The use case is many short running computations with contention created when aggregating the result.

Also, Opterons have pretty bad cache performance for inter-core comms (Intel uses inclusive L3s for faster on-package caches)

Well the other benches use the same hardware.
I am currently repeating the test on a dual socket x 6 core intel xeon (so 12 cores, 24Hardware-Threads). With Akka still being worst one by far.

√iktor Ҡlang

unread,

Jan 7, 2014, 7:40:50 AM1/7/14

to Akka User List

On Tue, Jan 7, 2014 at 1:32 PM, Rüdiger Möller <moru...@gmail.com> wrote:

Am Dienstag, 7. Januar 2014 11:57:54 UTC+1 schrieb √:

You definitely want to play around with the configuration and make sure that you are benchmarking correctly: https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java

I know the details regarding VM warmup and how to avoid in-place JIT'ing. I am iterating the bench 30 times, only take the average times of last 10 iterations (20 iterations warmup), see mainloop i added to the original AKKA sample. Also the benchmark is not in the nanos, but several hundred millis per iteration.

There is no reason at all to use currentTimeMillis (it's got accuracy problems (I've seen up to 20-30ms), just use nanoTime.

dispatcher throughput, thread pool type and thread pool size, and mailbox type.

can you be more specific pls (Code!). The use case is many short running computations with contention created when aggregating the result.

https://github.com/viktorklang/scala-vs-erlang/blob/9a124c75c8034d9ba90baa2751f21c51f1e64ddc/scala/src/main/resources/application.conf

Also, Opterons have pretty bad cache performance for inter-core comms (Intel uses inclusive L3s for faster on-package caches)

Well the other benches use the same hardware.
I am currently repeating the test on a dual socket x 6 core intel xeon (so 12 cores, 24Hardware-Threads). With Akka still being worst one by far.

Numbers?

(I also think we have to remain quite reasonable here, akka lets you scale out your computation up to ~2500 Jvms. Does the other solutions offer that?)

Cheers,

√

Patrik Nordwall

unread,

Jan 7, 2014, 8:23:06 AM1/7/14

to akka...@googlegroups.com

As pointed out I do think the creation and shutdown of the actor systems influence the results. I tried to run your code and after changing to one actor system my results are:

average 1 threads : 614

average 2 threads : 309

average 3 threads : 250

average 4 threads : 190

average 5 threads : 159

average 6 threads : 141

average 7 threads : 162

average 8 threads : 157

average 9 threads : 156

average 10 threads : 145

average 11 threads : 140

average 12 threads : 134

average 13 threads : 133

average 14 threads : 131

average 15 threads : 138

average 16 threads : 136

Compared to PiThreaded:

average 1 threads : 495

average 2 threads : 263

average 3 threads : 179

average 4 threads : 146

average 5 threads : 137

average 6 threads : 128

average 7 threads : 122

average 8 threads : 122

average 9 threads : 122

average 10 threads : 123

average 11 threads : 122

average 12 threads : 123

average 13 threads : 123

average 14 threads : 122

average 15 threads : 124

average 16 threads : 125

Cheers,

Patrik

On Tue, Jan 7, 2014 at 1:32 PM, Rüdiger Möller <moru...@gmail.com> wrote:

--

Patrik Nordwall
Typesafe - Reactive apps on the JVM
Twitter: @patriknw

Rüdiger Möller

unread,

Jan 7, 2014, 8:27:43 AM1/7/14

to akka...@googlegroups.com

Am Dienstag, 7. Januar 2014 13:40:50 UTC+1 schrieb √:

On Tue, Jan 7, 2014 at 1:32 PM, Rüdiger Möller <moru...@gmail.com> wrote:

Am Dienstag, 7. Januar 2014 11:57:54 UTC+1 schrieb √:

You definitely want to play around with the configuration and make sure that you are benchmarking correctly: https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java

I know the details regarding VM warmup and how to avoid in-place JIT'ing. I am iterating the bench 30 times, only take the average times of last 10 iterations (20 iterations warmup), see mainloop i added to the original AKKA sample. Also the benchmark is not in the nanos, but several hundred millis per iteration.

There is no reason at all to use currentTimeMillis (it's got accuracy problems (I've seen up to 20-30ms), just use nanoTime.

There is reason. System time millis is based on global system time, nanos is guaranteed to be consistent within a single thread only. The "accuracy problems" usually occur if one compares System.currentTimeMillis obtained from within different threads. Since runtimes are in the range of up to >1000 ms, and the tests are run many times i can say for sure this is not the reason why akka seems to scale not-that-good. Some historical problems with huge inaccuracy of systimemillis were with windows XP. I am using CenOS 6.4, 64bit.

dispatcher throughput, thread pool type and thread pool size, and mailbox type.

can you be more specific pls (Code!). The use case is many short running computations with contention created when aggregating the result.

https://github.com/viktorklang/scala-vs-erlang/blob/9a124c75c8034d9ba90baa2751f21c51f1e64ddc/scala/src/main/resources/application.conf

Also, Opterons have pretty bad cache performance for inter-core comms (Intel uses inclusive L3s for faster on-package caches)

Well the other benches use the same hardware.
I am currently repeating the test on a dual socket x 6 core intel xeon (so 12 cores, 24Hardware-Threads). With Akka still being worst one by far.

Numbers?

Opteron scaling
http://imgur.com/TAt9XOf

Intel:

========================================== 1m jobs each perform 100-pi-slice loop

AKKA
average 1 threads : 1914
average 2 threads : 970
average 3 threads : 1032
average 4 threads : 1099
average 5 threads : 1343
average 6 threads : 1336
average 7 threads : 1470
average 8 threads : 1543
average 9 threads : 1788
average 10 threads : 1500
average 11 threads : 1509
average 12 threads : 1454

synced Threading
average 1 threads : 800
average 2 threads : 951
average 3 threads : 953
average 4 threads : 1087
average 5 threads : 1087
average 6 threads : 1041
average 7 threads : 1028
average 8 threads : 982
average 9 threads : 1046
average 10 threads : 1031
average 11 threads : 1038
average 12 threads : 1015

Abstraktor
average 1 threads : 1349
average 2 threads : 674
average 3 threads : 477
average 4 threads : 380
average 5 threads : 323
average 6 threads : 302
average 7 threads : 321
average 8 threads : 329
average 9 threads : 354
average 10 threads : 369
average 11 threads : 386
average 12 threads : 385

========================================== 100k jobs each perform 1000-pi-slice loop

Abstractor
average 1 threads : 738
average 2 threads : 364
average 3 threads : 246
average 4 threads : 183
average 5 threads : 148
average 6 threads : 124
average 7 threads : 105
average 8 threads : 94
average 9 threads : 87
average 10 threads : 79
average 11 threads : 92
average 12 threads : 104

synced Threading
average 1 threads : 674
average 2 threads : 373
average 3 threads : 231
average 4 threads : 187
average 5 threads : 152
average 6 threads : 128
average 7 threads : 117
average 8 threads : 110
average 9 threads : 115
average 10 threads : 127
average 11 threads : 135
average 12 threads : 151

AKKA
average 1 threads : 772
average 2 threads : 378
average 3 threads : 295
average 4 threads : 238
average 5 threads : 201
average 6 threads : 162
average 7 threads : 139
average 8 threads : 128
average 9 threads : 118
average 10 threads : 114
average 11 threads : 128
average 12 threads : 118

As you can see when reducing the number of messages (and increase duration of a single job), Akka performance increases.
This indicates that Akka has pretty slow message passing/queuing (see also difference in single threaded run in the 1million messages case) implementation.
The bad scaling behaviour (well, its at least in some cases better than sync'ed threads) indicate there is serious contention somewhere in the central dispatching loop.
You can see in the 1m message examples, that Multithreading does not scale at all (due to contention), but still performs better than AKKA in this scenario. So basically a single threaded solution will be faster than multithreading and Akka. Only Abstractor improves up to 6 Threads, after that it also stalls due to contention.

(I also think we have to remain quite reasonable here, akka lets you scale out your computation up to ~2500 Jvms. Does the other solutions offer that?)

Network connected VM's :-). The computation is just a placeholder for the use case of "high rates of small events" which is typical for many real time systems. Scaling frequently does not pay-off because network (+decoding/encoding) becomes the bottleneck. Scaling is not about saturating many CPU's but about getting more throughput ;-)

Rüdiger Möller

unread,

Jan 7, 2014, 8:37:01 AM1/7/14

to akka...@googlegroups.com

Thanks for assistenc, can you please cut and paste your code here :-) ?

Hm, but i can't see that much difference to the original (except my machine is somewhat slower overall) ...

"
========================================== 100k jobs each perform 1000-pi-slice loop

AKKA
average 1 threads : 772
average 2 threads : 378
average 3 threads : 295
average 4 threads : 238
average 5 threads : 201
average 6 threads : 162
average 7 threads : 139
average 8 threads : 128
average 9 threads : 118
average 10 threads : 114
average 11 threads : 128
average 12 threads : 118

synced Threading
average 1 threads : 674
average 2 threads : 373
average 3 threads : 231
average 4 threads : 187
average 5 threads : 152

average 6 threads : 128

average 7 threads : 117
average 8 threads : 110
average 9 threads : 115
average 10 threads : 127
average 11 threads : 135
average 12 threads : 151
"

Can you run your sample and use 100-iterations and 1m jobs ... ?

"
int numStepsPerComp = 100;
int numJobs = 1000000;

Endre Varga

unread,

Jan 7, 2014, 8:37:21 AM1/7/14

to akka...@googlegroups.com

Well, apart from other side-effects creation of an ActorSystem at least results in the attempt of loading and assembling the configuration, so it is preferrable to reuse the same system between runs.

Rüdiger Möller

unread,

Jan 7, 2014, 8:41:37 AM1/7/14

to akka...@googlegroups.com

I'll incorporate that if you paste the source. However I cannot see something like a "boost" compared to the original version and that is quit logical, because the test runs up to 1 million jobs each performing a 100 iteration computing loop, so the duration of this dwarfes the overhead of initializing an actor system .. but anyway, maybe it helps at least you are the AKKA pro's :-)

√iktor Ҡlang

unread,

Jan 7, 2014, 8:46:22 AM1/7/14

to Akka User List

On Tue, Jan 7, 2014 at 2:27 PM, Rüdiger Möller <moru...@gmail.com> wrote:

Am Dienstag, 7. Januar 2014 13:40:50 UTC+1 schrieb √:

On Tue, Jan 7, 2014 at 1:32 PM, Rüdiger Möller <moru...@gmail.com> wrote:

Am Dienstag, 7. Januar 2014 11:57:54 UTC+1 schrieb √:

You definitely want to play around with the configuration and make sure that you are benchmarking correctly: https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java

I know the details regarding VM warmup and how to avoid in-place JIT'ing. I am iterating the bench 30 times, only take the average times of last 10 iterations (20 iterations warmup), see mainloop i added to the original AKKA sample. Also the benchmark is not in the nanos, but several hundred millis per iteration.

There is no reason at all to use currentTimeMillis (it's got accuracy problems (I've seen up to 20-30ms), just use nanoTime.

There is reason. System time millis is based on global system time, nanos is guaranteed to be consistent within a single thread only. The "accuracy problems" usually occur if one compares System.currentTimeMillis obtained from within different threads. Since runtimes are in the range of up to >1000 ms, and the tests are run many times i can say for sure this is not the reason why akka seems to scale not-that-good. Some historical problems with huge inaccuracy of systimemillis were with windows XP. I am using CenOS 6.4, 64bit.

nanoTime is supposedly monotonic, where is your reference to the "same thread" claim?

dispatcher throughput, thread pool type and thread pool size, and mailbox type.

can you be more specific pls (Code!). The use case is many short running computations with contention created when aggregating the result.

https://github.com/viktorklang/scala-vs-erlang/blob/9a124c75c8034d9ba90baa2751f21c51f1e64ddc/scala/src/main/resources/application.conf

Did you apply this?

What configuration have you been using?

(I also think we have to remain quite reasonable here, akka lets you scale out your computation up to ~2500 Jvms. Does the other solutions offer that?)

Network connected VM's :-).

?

The computation is just a placeholder for the use case of "high rates of small events" which is typical for many real time systems. Scaling frequently does not pay-off because network (+decoding/encoding) becomes the bottleneck. Scaling is not about saturating many CPU's but about getting more throughput ;-)

Single-machine performance is only interesting if you are after single points of failure.

Cheers,

√

Patrik Nordwall

unread,

Jan 7, 2014, 8:58:12 AM1/7/14

to akka...@googlegroups.com

Sorry, already thrown away. I'm sure you can rewrite and test it yourself.

/Patrik

Rüdiger Möller

unread,

Jan 7, 2014, 10:22:21 AM1/7/14

to akka...@googlegroups.com

nanoTime is supposedly monotonic, where is your reference to the "same thread" claim?

Its guaranteed to be monotonic seen from a single thread, not across threads. systime millis is guaranteed to be monotonic across threads, so its more expensive and requires some fencing etc. to be generated by hotspot. There is a video of Cliff Click out there where he goes into that in great detail ..
Anyway, the results are not skewed by that for sure.

https://github.com/viktorklang/scala-vs-erlang/blob/9a124c75c8034d9ba90baa2751f21c51f1e64ddc/scala/src/main/resources/application.conf

Did you apply this?

No, but i will try. I am not interested in presenting skewed benchmarks. Abstraktor is not a competing project, its just my playground lean actor impl to get a raw feeling of what should be possible.

| Single-machine performance is only interesting if you are after single points of failure.

Both things are important: single machine performance AND remote messaging throughput + latency. Regarding remoting/failover there are much faster options than actors/Akka today.
I appreciate your vision of making this transparent to the application. Its a great idea, but I think your are still not there for the very high end kind of application, no offence. I have built large high performance distributed systems, so I know what I am talking bout.
However regarding concurrent programming, actors can improve performance and maintainability today, that's why i am currently investigating/benchmarking local performance only.
I'll will incorporate your proposals into the test.

regards,
Rüdiger

√iktor Ҡlang

unread,

Jan 7, 2014, 10:32:48 AM1/7/14

to Akka User List

On Tue, Jan 7, 2014 at 4:22 PM, Rüdiger Möller <moru...@gmail.com> wrote:

nanoTime is supposedly monotonic, where is your reference to the "same thread" claim?

Its guaranteed to be monotonic seen from a single thread, not across threads. systime millis is guaranteed to be monotonic across threads, so its more expensive and requires some fencing etc. to be generated by hotspot. There is a video of Cliff Click out there where he goes into that in great detail ..
Anyway, the results are not skewed by that for sure.

I'm not sure a hand-wavy reference to Cliff is going to quench my thirst for facts even though Cliff is a great guy.

https://github.com/viktorklang/scala-vs-erlang/blob/9a124c75c8034d9ba90baa2751f21c51f1e64ddc/scala/src/main/resources/application.conf

Did you apply this?

No, but i will try. I am not interested in presenting skewed benchmarks. Abstraktor is not a competing project, its just my playground lean actor impl to get a raw feeling of what should be possible.

The default settings is definitely not optimal for your use-case (as default settings rarely are).

| Single-machine performance is only interesting if you are after single points of failure.

Both things are important: single machine performance AND remote messaging throughput + latency.

Yep, my argument was that without remote you have a spof.

Regarding remoting/failover there are much faster options than actors/Akka today.

Reference?

I appreciate your vision of making this transparent to the application. Its a great idea, but I think your are still not there for the very high end kind of application, no offence. I have built large high performance distributed systems, so I know what I am talking bout.

What are you basing this opinion on, and what benchmark/setting are you comparing?

However regarding concurrent programming, actors can improve performance and maintainability today, that's why i am currently investigating/benchmarking local performance only.
I'll will incorporate your proposals into the test.

Great, let us know what the results were.

Cheers,

√

regards,
Rüdiger

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://akka.io/faq/
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/groups/opt_out.

Endre Varga

unread,

Jan 7, 2014, 11:22:04 AM1/7/14

to akka...@googlegroups.com

On Tue, Jan 7, 2014 at 4:32 PM, √iktor Ҡlang <viktor...@gmail.com> wrote:

On Tue, Jan 7, 2014 at 4:22 PM, Rüdiger Möller <moru...@gmail.com> wrote:

nanoTime is supposedly monotonic, where is your reference to the "same thread" claim?

Its guaranteed to be monotonic seen from a single thread, not across threads. systime millis is guaranteed to be monotonic across threads, so its more expensive and requires some fencing etc. to be generated by hotspot. There is a video of Cliff Click out there where he goes into that in great detail ..
Anyway, the results are not skewed by that for sure.

I'm not sure a hand-wavy reference to Cliff is going to quench my thirst for facts even though Cliff is a great guy.

I googled around and I have not found any conclusive answer if you take old systems into account. There is this blog post (now more than 6 years old) (https://blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks):

"The default mechanism used by QPC is determined by the Hardware Abstraction layer(HAL), but some systems allow you to explicitly control it using options in boot.ini, such as /usepmtimer that explicitly requests use of the power management timer. This default changes not only across hardware but also across OS versions. For example Windows XP Service Pack 2 changed things to use the power management timer (PMTimer) rather than the processor timestamp-counter (TSC) due to problems with the TSC not being synchronized on different processors in SMP systems, and due the fact its frequency can vary (and hence its relationship to elapsed time) based on power-management settings. "

Also another one (https://lkml.org/lkml/2005/11/4/173):

Current AMD Opteron(tm) and Athlon(tm)64 processors provide power management mechanisms that independently adjust the performance state ("P-state") and power state ("C-state") of the processor[1][2]; these state changes can affect a processor core's Time Stamp Counter (TSC) which some operating systems may use as a part of their time keeping algorithms. Most modern operating systems are well aware of the effect of these state changes on the TSC and the potential for TSC drift[3] across multiple processor cores and properly account for it.

I don't think these apply for current systems though (the above posts are really old) and I don't even see definitive answer for old systems -- unfortunately Björn is not here to ask him.

Anyway this has not much relevance for this discussion.

https://github.com/viktorklang/scala-vs-erlang/blob/9a124c75c8034d9ba90baa2751f21c51f1e64ddc/scala/src/main/resources/application.conf

Did you apply this?

No, but i will try. I am not interested in presenting skewed benchmarks. Abstraktor is not a competing project, its just my playground lean actor impl to get a raw feeling of what should be possible.

Tuning the dispatcher is not skewing the benchmarks. The whole idea of dispatchers is that you can tune subsystems of your actor system to particular load characteristics. The default throughput setting hits a particular point in the fairness-throughput tradeoff spectrum, which is not the best for batch workloads.

| Single-machine performance is only interesting if you are after single points of failure.

Both things are important: single machine performance AND remote messaging throughput + latency.

Yep, my argument was that without remote you have a spof.

Regarding remoting/failover there are much faster options than actors/Akka today.

As for failover, if you are limited to software implementation the speed of remote failover is bounded by a timeout period, it does not matter what software framework (Akka or other) is used. If you have any side-channel information, maybe hardware solutions e.g. link failure notifications or hardware watchdogs the game is different -- but that is apples to oranges.

I appreciate your vision of making this transparent to the application. Its a great idea, but I think your are still not there for the very high end kind of application, no offence. I have built large high performance distributed systems, so I know what I am talking bout.

It is a bit of a strawman. For any kind of particular use-casethe fastest implementation is a custom hand-tuned one designed by an expert and I don't doubt that you can beat Akka in many particular scenarios. In fact, for every system there is always one more benchmark that you cannot beat. It all depends how much resources you have to throw against your problem (and maintaining it over time).

However regarding concurrent programming, actors can improve performance and maintainability today, that's why i am currently investigating/benchmarking local performance only.
I'll will incorporate your proposals into the test.

We could in theory play around with the example and fine-tune (I am very tempted to try it now), but the problem is that we are preparing a release and we cannot really allocate any time to this particular benchmark. Play around with the dispatcher settings a bit and see how it works out -- try tuning the throughput setting in particular.

-Endre

Rüdiger Möller

unread,

Jan 7, 2014, 11:26:53 AM1/7/14

to akka...@googlegroups.com

I'm not sure a hand-wavy reference to Cliff is going to quench my thirst for facts even though Cliff is a great guy.

I googled that for you:

A JVM Does That? - YouTube

Again: your obsession with time measurement makes sense when measuring small amounts of ticks and adding them up, but not in the context of a longer running test. You easily may copy the snippet and add nanotime measurement. It will not make significant difference.

https://github.com/viktorklang/scala-vs-erlang/blob/9a124c75c8034d9ba90baa2751f21c51f1e64ddc/scala/src/main/resources/application.conf

Did you apply this?

No, but i will try. I am not interested in presenting skewed benchmarks. Abstraktor is not a competing project, its just my playground lean actor impl to get a raw feeling of what should be possible.

The default settings is definitely not optimal for your use-case (as default settings rarely are).

That's why I preferred talking back here :-)

| Single-machine performance is only interesting if you are after single points of failure.

Both things are important: single machine performance AND remote messaging throughput + latency.

Yep, my argument was that without remote you have a spof.

Regarding remoting/failover there are much faster options than actors/Akka today.

Reference?

thinking of IBM WLLM or Informatika UM. Does Akka offer reliable UDP messaging with several million msg/sec througput, from what I have seen its challenging enough to get this on localhost. I have used the former, its really blasting fast (at elast with kernel bypass networking equipment).

I appreciate your vision of making this transparent to the application. Its a great idea, but I think your are still not there for the very high end kind of application, no offence. I have built large high performance distributed systems, so I know what I am talking bout.

What are you basing this opinion on, and what benchmark/setting are you comparing?

Benchmarks and some public bragging with not-so-impressive numbers .. ;-). Also i can see in single threaded benchmarks, Akka's message passing adds significant overhead compared to single threaded executor and byte-weaving based proxying as used in other libs. One even can do remote messaging faster than Akka does inter thread messaging, so there definitely is room for improvement. Does Akka provide reliable UDP messaging (NAK based, not acknowledged ?). That's what you need for high end throughput+failover imo. Doing typed actor message passing via JDK proxies e.g. is well .. you should know yourself :-)

However regarding concurrent programming, actors can improve performance and maintainability today, that's why i am currently investigating/benchmarking local performance only.
I'll will incorporate your proposals into the test.

Great, let us know what the results were.

Don't expect too much, as I pointed out from my POV the problem is already inside Akkas basic message passing performance IMO, so Akka has a hard time breaking even when scaling. We'll see.

√iktor Ҡlang

unread,

Jan 7, 2014, 11:39:30 AM1/7/14

to Akka User List

On Tue, Jan 7, 2014 at 5:26 PM, Rüdiger Möller <moru...@gmail.com> wrote:

I'm not sure a hand-wavy reference to Cliff is going to quench my thirst for facts even though Cliff is a great guy.

I googled that for you:
A JVM Does That? - YouTube

I think you missed my point. Which was that there is no, to my knowledge, specification indicating that nanoTime shouldn't behave monotonically and non-monotonicity should be considered to be a bug.

Again: your obsession with time measurement makes sense when measuring small amounts of ticks and adding them up, but not in the context of a longer running test. You easily may copy the snippet and add nanotime measurement. It will not make significant difference.

No, my point is that using currentTimeMillis to obtain durations _at all_ is to be considered bad practice due to the shoddy accuracy.

https://github.com/viktorklang/scala-vs-erlang/blob/9a124c75c8034d9ba90baa2751f21c51f1e64ddc/scala/src/main/resources/application.conf

Did you apply this?

No, but i will try. I am not interested in presenting skewed benchmarks. Abstraktor is not a competing project, its just my playground lean actor impl to get a raw feeling of what should be possible.

The default settings is definitely not optimal for your use-case (as default settings rarely are).

That's why I preferred talking back here :-)

:)

| Single-machine performance is only interesting if you are after single points of failure.

Both things are important: single machine performance AND remote messaging throughput + latency.

Yep, my argument was that without remote you have a spof.

Regarding remoting/failover there are much faster options than actors/Akka today.

Reference?

thinking of IBM WLLM or Informatika UM. Does Akka offer reliable UDP messaging with several million msg/sec througput, from what I have seen its challenging enough to get this on localhost. I have used the former, its really blasting fast (at elast with kernel bypass networking equipment).

You're comparing apples to oranges, i.e. a transport with a model of computation.

Akkas remoting transport is pluggable so you could implement an UM version of it if you so wish. Or even that WLLM!

I appreciate your vision of making this transparent to the application. Its a great idea, but I think your are still not there for the very high end kind of application, no offence. I have built large high performance distributed systems, so I know what I am talking bout.

What are you basing this opinion on, and what benchmark/setting are you comparing?

Benchmarks and some public bragging with not-so-impressive numbers .. ;-). Also i can see in single threaded benchmarks, Akka's message passing adds significant overhead compared to single threaded executor and byte-weaving based proxying as used in other libs.

Reference?

One even can do remote messaging faster than Akka does inter thread messaging, so there definitely is room for improvement.

Akka Remote Transport is as I said, pluggable.

Does Akka provide reliable UDP messaging (NAK based, not acknowledged ?).

Akka Remote Transport is ...

That's what you need for high end throughput+failover imo. Doing typed actor message passing via JDK proxies e.g. is well .. you should know yourself :-)

Absolutely, TypedActors used to be based on AspectWerkz proxies but were repurposed to use JDK Proxies due to the use-case. You are of course, if you want, free to use a JVM that ships with extreme performance JDK Proxies. :)

However regarding concurrent programming, actors can improve performance and maintainability today, that's why i am currently investigating/benchmarking local performance only.
I'll will incorporate your proposals into the test.

Great, let us know what the results were.

Don't expect too much, as I pointed out from my POV the problem is already inside Akkas basic message passing performance IMO, so Akka has a hard time breaking even when scaling. We'll see.

I've seen differences up to 4 magnitudes just with configuration changes. As you can imagine, I have spent quite some time tuning Akka.

Cheers,

√

Alec Zorab

unread,

Jan 7, 2014, 11:40:44 AM1/7/14

to akka...@googlegroups.com

I can't see your imgur results from the office, but when I run your two gists on my workstation I get results like this - do yours differ significantly from mine?

akka:

Pi approximation: 3.1415926435898274 Calculation time: 244

average 1 threads : 767

average 2 threads : 396

average 3 threads : 303

average 4 threads : 259

average 5 threads : 227

average 6 threads : 205

average 7 threads : 188

average 8 threads : 187

average 9 threads : 222

average 10 threads : 218

average 11 threads : 210

average 12 threads : 208

average 13 threads : 205

average 14 threads : 208

average 15 threads : 209

average 16 threads : 205

ThreadPi:

average 1 threads : 720

average 2 threads : 383

average 3 threads : 278

average 4 threads : 210

average 5 threads : 198

average 6 threads : 189

average 7 threads : 184

average 8 threads : 167

average 9 threads : 186

average 10 threads : 184

average 11 threads : 186

average 12 threads : 183

average 13 threads : 189

average 14 threads : 185

average 15 threads : 186

average 16 threads : 187

Rüdiger Möller

unread,

Jan 7, 2014, 11:44:08 AM1/7/14

to akka...@googlegroups.com

Also another one (https://lkml.org/lkml/2005/11/4/173):

Current AMD Opteron(tm) and Athlon(tm)64 processors provide power management mechanisms that independently adjust the performance state ("P-state") and power state ("C-state") of the processor[1][2]; these state changes can affect a processor core's Time Stamp Counter (TSC) which some operating systems may use as a part of their time keeping algorithms. Most modern operating systems are well aware of the effect of these state changes on the TSC and the potential for TSC drift[3] across multiple processor cores and properly account for it.

I don't think these apply for current systems though (the above posts are really old) and I don't even see definitive answer for old systems -- unfortunately Björn is not here to ask him.

All my machines have power saving turned off. The intel boxes run with hyperthreading disabled ..

Tuning the dispatcher is not skewing the benchmarks. The whole idea of dispatchers is that you can tune subsystems of your actor system to particular load characteristics. The default throughput setting hits a particular point in the fairness-throughput tradeoff spectrum, which is not the best for batch workloads.

I agree. I just wanted to state, that I am not interested in presenting "Bad Akka" as some of the comments looked like you fell offended ;-).

| Single-machine performance is only interesting if you are after single points of failure.

Both things are important: single machine performance AND remote messaging throughput + latency.

Yep, my argument was that without remote you have a spof.

Regarding remoting/failover there are much faster options than actors/Akka today.

As for failover, if you are limited to software implementation the speed of remote failover is bounded by a timeout period, it does not matter what software framework (Akka or other) is used. If you have any side-channel information, maybe hardware solutions e.g. link failure notifications or hardware watchdogs the game is different -- but that is apples to oranges.

Disagree. You can run systems redundantly with total message ordering and always get the fastest response. This is zero latency failover. Needs a decent reliable UDP messaging stack ofc.

I appreciate your vision of making this transparent to the application. Its a great idea, but I think your are still not there for the very high end kind of application, no offence. I have built large high performance distributed systems, so I know what I am talking bout.

It is a bit of a strawman. For any kind of particular use-casethe fastest implementation is a custom hand-tuned one designed by an expert and I don't doubt that you can beat Akka in many particular scenarios. In fact, for every system there is always one more benchmark that you cannot beat. It all depends how much resources you have to throw against your problem (and maintaining it over time).

Mostly agree. However there's no excuse in not using the fastest possible option in basic mechanics like queued message dispatch. I have reasonable suspicion this is the case (will have to investigate).

However regarding concurrent programming, actors can improve performance and maintainability today, that's why i am currently investigating/benchmarking local performance only.
I'll will incorporate your proposals into the test.

We could in theory play around with the example and fine-tune (I am very tempted to try it now), but the problem is that we are preparing a release and we cannot really allocate any time to this particular benchmark. Play around with the dispatcher settings a bit and see how it works out -- try tuning the throughput setting in particular.

As long the benchmark processes 1 million of independend Pi computation slices concurrently, any tuning would be fair (and welcome). I am not so sure regarding "batching" optimizations as this actually reduces the number of messages processed. However an adaptive batching dispatcher could boost a lot (i know this from my network related work), but at the cost of increased latency. This test is not about batching but about processing many tiny units of work e.g. market data ;-)

regards,
rüdiger

Rüdiger Möller

unread,

Jan 7, 2014, 11:47:39 AM1/7/14

to akka...@googlegroups.com

Yes i posted the 100 iterations with 1million messages on imgur. The samples are with 1000 iterations per job and 100k jobs. Unfortunately i cannot access google docs from work .. i have the same chart for the 1000/100k version you are running

Just change the constants at the top/bottom of the snippets.

√iktor Ҡlang

unread,

Jan 7, 2014, 11:52:28 AM1/7/14

to Akka User List

On Tue, Jan 7, 2014 at 5:44 PM, Rüdiger Möller <moru...@gmail.com> wrote:

Also another one (https://lkml.org/lkml/2005/11/4/173):

Current AMD Opteron(tm) and Athlon(tm)64 processors provide power management mechanisms that independently adjust the performance state ("P-state") and power state ("C-state") of the processor[1][2]; these state changes can affect a processor core's Time Stamp Counter (TSC) which some operating systems may use as a part of their time keeping algorithms. Most modern operating systems are well aware of the effect of these state changes on the TSC and the potential for TSC drift[3] across multiple processor cores and properly account for it.

I don't think these apply for current systems though (the above posts are really old) and I don't even see definitive answer for old systems -- unfortunately Björn is not here to ask him.

All my machines have power saving turned off. The intel boxes run with hyperthreading disabled ..

Tuning the dispatcher is not skewing the benchmarks. The whole idea of dispatchers is that you can tune subsystems of your actor system to particular load characteristics. The default throughput setting hits a particular point in the fairness-throughput tradeoff spectrum, which is not the best for batch workloads.

I agree. I just wanted to state, that I am not interested in presenting "Bad Akka" as some of the comments looked like you fell offended ;-).

Sounds just like a minor comms mishap.

| Single-machine performance is only interesting if you are after single points of failure.

Both things are important: single machine performance AND remote messaging throughput + latency.

Yep, my argument was that without remote you have a spof.

Regarding remoting/failover there are much faster options than actors/Akka today.

As for failover, if you are limited to software implementation the speed of remote failover is bounded by a timeout period, it does not matter what software framework (Akka or other) is used. If you have any side-channel information, maybe hardware solutions e.g. link failure notifications or hardware watchdogs the game is different -- but that is apples to oranges.

Disagree. You can run systems redundantly with total message ordering and always get the fastest response. This is zero latency failover. Needs a decent reliable UDP messaging stack ofc.

How is this failover and not "competing consumers"? (i.e. you have to notice someone is down before failing over, death and delay is indistinguishable in distributed systems)

I appreciate your vision of making this transparent to the application. Its a great idea, but I think your are still not there for the very high end kind of application, no offence. I have built large high performance distributed systems, so I know what I am talking bout.

It is a bit of a strawman. For any kind of particular use-casethe fastest implementation is a custom hand-tuned one designed by an expert and I don't doubt that you can beat Akka in many particular scenarios. In fact, for every system there is always one more benchmark that you cannot beat. It all depends how much resources you have to throw against your problem (and maintaining it over time).

Mostly agree. However there's no excuse in not using the fastest possible option in basic mechanics like queued message dispatch. I have reasonable suspicion this is the case (will have to investigate).

I'm not sure I follow. Clearly there's a tradeoff between fairness and throughput due to platform artifacts.

Cheers,

√

However regarding concurrent programming, actors can improve performance and maintainability today, that's why i am currently investigating/benchmarking local performance only.
I'll will incorporate your proposals into the test.

We could in theory play around with the example and fine-tune (I am very tempted to try it now), but the problem is that we are preparing a release and we cannot really allocate any time to this particular benchmark. Play around with the dispatcher settings a bit and see how it works out -- try tuning the throughput setting in particular.

As long the benchmark processes 1 million of independend Pi computation slices concurrently, any tuning would be fair (and welcome). I am not so sure regarding "batching" optimizations as this actually reduces the number of messages processed. However an adaptive batching dispatcher could boost a lot (i know this from my network related work), but at the cost of increased latency. This test is not about batching but about processing many tiny units of work e.g. market data ;-)

regards,
rüdiger

--

>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://akka.io/faq/
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/groups/opt_out.

Rüdiger Möller

unread,

Jan 7, 2014, 12:54:44 PM1/7/14

to akka...@googlegroups.com

Am Dienstag, 7. Januar 2014 17:52:28 UTC+1 schrieb √:

Sounds just like a minor comms mishap.

Sorry cannot figure out what this means .. I am a native german

How is this failover and not "competing consumers"? (i.e. you have to notice someone is down before failing over, death and delay is indistinguishable in distributed systems)

Maybe I named it wrong, however one can do delayless failover this way. It doesn't affect any client as the second instance keeps responding, so there is no delay in processing.

I'm not sure I follow. Clearly there's a tradeoff between fairness and throughput due to platform artifacts.

Its more basic: Using proxies for typed actors is really wasteful. Untyped actors on the other hand replace message dispatch with 'instancof' chaining, which prevents any hotspot call optimization in case of direct calls (both actors share same thread => direct method call done instead of queuing). Is Akka doing direct dispatch in case of typed actors on same dispatcher thread (if not: thanks god my bench is not covering this ;-)) ) ?

√iktor Ҡlang

unread,

Jan 7, 2014, 1:10:25 PM1/7/14

to Akka User List

On Jan 7, 2014 6:54 PM, "Rüdiger Möller" <moru...@gmail.com> wrote:
>
>
>
> Am Dienstag, 7. Januar 2014 17:52:28 UTC+1 schrieb √:
>>
>>
>>
>> Sounds just like a minor comms mishap.
>
>
> Sorry cannot figure out what this means .. I am a native german

I am native Swedish; you misunderstood eachother.

>
>>
>>
>> How is this failover and not "competing consumers"? (i.e. you have to notice someone is down before failing over, death and delay is indistinguishable in distributed systems)
>>
>
>
> Maybe I named it wrong, however one can do delayless failover this way. It doesn't affect any client as the second instance keeps responding, so there is no delay in processing.

But it requires that workstealing is OK which is a subset of cases.

>
>>
>> I'm not sure I follow. Clearly there's a tradeoff between fairness and throughput due to platform artifacts.
>>
>
> Its more basic: Using proxies for typed actors is really wasteful.

Provably untrue; You MUST have a logical proxy since it is a distributed model. The wastefulness of said proxy is implementation dependent and as such you cannot make any claim of efficiency of unstated implementation or in general.

Untyped actors on the other hand replace message dispatch with 'instancof' chaining, which prevents any hotspot call optimization in case of direct calls (both actors share same thread => direct method call done instead of queuing).

Which is what you want since otherwise you're synchronous, i.e. a malicious or broken recipient can prevent progress of the sender's logic leading to extremely brittle systems. See http://blog.ometer.com/2011/07/24/callbacks-synchronous-and-asynchronous

Is Akka doing direct dispatch in case of typed actors on same dispatcher thread (if not: thanks god my bench is not covering this ;-)) ) ?

No, it doesn't, for the reasons mentioned above. Any distributed model based on synchrony seems like a bad idea.

Cheers,
V

√iktor Ҡlang

unread,

Jan 7, 2014, 1:12:58 PM1/7/14

to Akka User List

Now; you asked for ways of improving the Akka actor performance, we have provided the relevant information for you to do so.
Let's stay on topic.

Cheers,
V

Rüdiger Möller

unread,

Jan 7, 2014, 1:27:26 PM1/7/14

to akka...@googlegroups.com

No, my point is that using currentTimeMillis to obtain durations _at all_ is to be considered bad practice due to the shoddy accuracy.

systemmillis is wall clock time. When measuring durations > 500 ms, accuracy issues are not a problem. Modern OSes + VM have better accuracy than older ones. I'll change to nanos just in case.

You're comparing apples to oranges, i.e. a transport with a model of computation.
Akkas remoting transport is pluggable so you could implement an UM version of it if you so wish. Or even that WLLM!

Ok, wasn't aware of that pluggability feature. Good.

Reference?

From post above (intel xeon 2 socket x 6 cores) ..

========================================== 1m jobs each perform 100-pi-slice loop
AKKA
average 1 threads : 1914

Abstraktor
average 1 threads : 1349

synced Threading
average 1 threads : 800

"sync'ed threading" schedules runnables to an executor which obviously is fastest.

Abstraktor prototype pushes methods onto a concurrentlinkedqueue and executes calls via reflection. This already produces significant overhead.

Akka has an overhead of >2 times the single threading case.

This would not be a problem if they scale infinitely (Threading does not scale at all in the 1 million message case). But they don't because the queues passing inter-thread messages create contention (to a lesser extend compared to threading). Both abstraktor and akka stop scaling at a certain amount of CPU cores used.

So if the basic overhead is too high, the break even never comes ! Even worse (don't know the exact reason): Default Akka Q's seem to produce more contention than my prototype'ish plain polled CLQ. So Akka comes with the highest dispatch overhead and scales out worst due to contention: double fail. You should do something about that. Fast message passing is at the core of the system, its not a good idea to relax regarding efficiency in such a critical part of your system.

That's what you need for high end throughput+failover imo. Doing typed actor message passing via JDK proxies e.g. is well .. you should know yourself :-)

Absolutely, TypedActors used to be based on AspectWerkz proxies but were repurposed to use JDK Proxies due to the use-case. You are of course, if you want, free to use a JVM that ships with extreme performance JDK Proxies. :)

400 lines of byte code weaving can fix that.

I've seen differences up to 4 magnitudes just with configuration changes. As you can imagine, I have spent quite some time tuning Akka.

Ok, i just have to stop posting in order to do the test now ... :-)

- rüdiger

Rüdiger Möller

unread,

Jan 7, 2014, 1:59:57 PM1/7/14

to akka...@googlegroups.com

>> How is this failover and not "competing consumers"? (i.e. you have to notice someone is down before failing over, death and delay is indistinguishable in distributed systems)
>>
>
>
> Maybe I named it wrong, however one can do delayless failover this way. It doesn't affect any client as the second instance keeps responding, so there is no delay in processing.

But it requires that workstealing is OK which is a subset of cases.

How is that work stealing. Consider N receivers in identical state responding to the same requests (multicast, so requests are not sent twice). N receiver respond, but the requestor just takes the first response and ignores the other responses.

Provably untrue; You MUST have a logical proxy since it is a distributed model. The wastefulness of said proxy is implementation dependent and as such you cannot make any claim of efficiency of unstated implementation or in general.

I can make the claim that it is awful slow with the only usable server java VM on the market on the most frequently used hardware platform :-). You can roll your own proxy implementation with reasonable effort.

Untyped actors on the other hand replace message dispatch with 'instancof' chaining, which prevents any hotspot call optimization in case of direct calls (both actors share same thread => direct method call done instead of queuing).

Which is what you want since otherwise you're synchronous, i.e. a malicious or broken recipient can prevent progress of the sender's logic leading to extremely brittle systems. See http://blog.ometer.com/2011/07/24/callbacks-synchronous-and-asynchronous

I think an actor framework should not support synchronous callbacks at all. You don't need them.

In contradiction to the blog post above, in abstractor, callbacks do not come in a different thread, but put a message on the

actors queue (except when sharing thread with caller).

Is Akka doing direct dispatch in case of typed actors on same dispatcher thread (if not: thanks god my bench is not covering this ;-)) ) ?

No, it doesn't, for the reasons mentioned above. Any distributed model based on synchrony seems like a bad idea.

Uhh, that's a very academic point of view. The speed difference of a direct call and a message being queued is >1000 times. One can keep the contract on Actors, but optimize the dispatch in case they share same thread/dispatcher.

As long synchronous results are forbidden, this does not affect functionality or behaviour of an Actor.

Yes, it *may* happen the receiver blocks due to ill behaviour. If the same ill Actor gets messages queued, it will get a queue overflow in most cases anyway. I'd consider this a bug that needs a fix. The performance tradeoff is

massive and forces coarse grained actor design, which on the other hand creates harder-to-balance apps.

I see your reasons, for me this is a no go out of practical considerations.

Alec Zorab

unread,

Jan 7, 2014, 2:11:42 PM1/7/14

to akka...@googlegroups.com

Right, just so I'm clear - running your tests, I see something on the order of a 10% performance penalty for Akka vs your solution using all sorts of excitement with countdown latches and thread parking. Are you seeing a difference of more than 10%? I can't see your results, so I can't see what differences you're observing. If you're seeing something out of line with my results, we should be looking at mine. If you're seeing performance that agrees with my experience, I think we can probably agree that a 10% performance penalty in exchange for not having to do explicit management is a worthwhile exchange in a non-negligible set of use cases.

√iktor Ҡlang

unread,

Jan 7, 2014, 2:30:33 PM1/7/14

to Akka User List

On Tue, Jan 7, 2014 at 7:27 PM, Rüdiger Möller <moru...@gmail.com> wrote:

No, my point is that using currentTimeMillis to obtain durations _at all_ is to be considered bad practice due to the shoddy accuracy.

systemmillis is wall clock time. When measuring durations > 500 ms, accuracy issues are not a problem. Modern OSes + VM have better accuracy than older ones. I'll change to nanos just in case.

Yeah, so if you have clock drift or someone changes the wallclock between invocations you're hosed :)

You're comparing apples to oranges, i.e. a transport with a model of computation.
Akkas remoting transport is pluggable so you could implement an UM version of it if you so wish. Or even that WLLM!

Ok, wasn't aware of that pluggability feature. Good.

Reference?

From post above (intel xeon 2 socket x 6 cores) ..

========================================== 1m jobs each perform 100-pi-slice loop
AKKA
average 1 threads : 1914
Abstraktor
average 1 threads : 1349
synced Threading
average 1 threads : 800

"sync'ed threading" schedules runnables to an executor which obviously is fastest.

Abstraktor prototype pushes methods onto a concurrentlinkedqueue and executes calls via reflection. This already produces significant overhead.
Akka has an overhead of >2 times the single threading case.

This would not be a problem if they scale infinitely (Threading does not scale at all in the 1 million message case). But they don't because the queues passing inter-thread messages create contention (to a lesser extend compared to threading). Both abstraktor and akka stop scaling at a certain amount of CPU cores used.

So if the basic overhead is too high, the break even never comes !

I'd beg to differ.Once you hit the ceiling of scaling up you scale out.

And in any case, contention needs to be managed if you don't tolerate message loss (which you should).

Even worse (don't know the exact reason): Default Akka Q's seem to produce more contention than my prototype'ish plain polled CLQ. So Akka comes with the highest dispatch overhead and scales out worst due to contention: double fail. You should do something about that. Fast message passing is at the core of the system, its not a good idea to relax regarding efficiency in such a critical part of your system.

You're basing this on unoptimized settings, please don't.

That's what you need for high end throughput+failover imo. Doing typed actor message passing via JDK proxies e.g. is well .. you should know yourself :-)

Absolutely, TypedActors used to be based on AspectWerkz proxies but were repurposed to use JDK Proxies due to the use-case. You are of course, if you want, free to use a JVM that ships with extreme performance JDK Proxies. :)

400 lines of byte code weaving can fix that.

You are absolutely free to use any weaving lib you want :-), the full Akka TypedActor impl is 1 file, 0 external dependencies and 700 lines (including whitespace and comments/documentation): https://github.com/akka/akka/blob/master/akka-actor/src/main/scala/akka/actor/TypedActor.scala

Akka TypedActors both works distributed and will benefit from any and all optimizations done to JDK proxies. Worth noting is that TypedActors is not a replacement for untyped actors: http://doc.akka.io/docs/akka/2.2.3/scala/typed-actors.html#When_to_use_Typed_Actors

I've seen differences up to 4 magnitudes just with configuration changes. As you can imagine, I have spent quite some time tuning Akka.

Ok, i just have to stop posting in order to do the test now ... :-)

Cool,

Cheers,

√

- rüdiger

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://akka.io/faq/
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/groups/opt_out.

√iktor Ҡlang

unread,

Jan 7, 2014, 2:41:41 PM1/7/14

to Akka User List

On Tue, Jan 7, 2014 at 7:59 PM, Rüdiger Möller <moru...@gmail.com> wrote:

>> How is this failover and not "competing consumers"? (i.e. you have to notice someone is down before failing over, death and delay is indistinguishable in distributed systems)
>>
>
>
> Maybe I named it wrong, however one can do delayless failover this way. It doesn't affect any client as the second instance keeps responding, so there is no delay in processing.

But it requires that workstealing is OK which is a subset of cases.

How is that work stealing. Consider N receivers in identical state responding to the same requests (multicast, so requests are not sent twice). N receiver respond, but the requestor just takes the first response and ignores the other responses.

That's called "hot standby"—which may be fine for some use cases but not all: consider a messaged called "LaunchMissiles".

Provably untrue; You MUST have a logical proxy since it is a distributed model. The wastefulness of said proxy is implementation dependent and as such you cannot make any claim of efficiency of unstated implementation or in general.

I can make the claim that it is awful slow with the only usable server java VM on the market on the most frequently used hardware platform :-). You can roll your own proxy implementation with reasonable effort.

See my answer to this in my recent email.

Untyped actors on the other hand replace message dispatch with 'instancof' chaining, which prevents any hotspot call optimization in case of direct calls (both actors share same thread => direct method call done instead of queuing).

Which is what you want since otherwise you're synchronous, i.e. a malicious or broken recipient can prevent progress of the sender's logic leading to extremely brittle systems. See http://blog.ometer.com/2011/07/24/callbacks-synchronous-and-asynchronous

I think an actor framework should not support synchronous callbacks at all. You don't need them.
In contradiction to the blog post above, in abstractor, callbacks do not come in a different thread, but put a message on the

actors queue (except when sharing thread with caller).

You mean enqueuing in a "thread local queue" and execute those callbacks after the current callback is done executing? We offer something for that:

https://github.com/akka/akka/blob/master/akka-actor/src/main/scala/akka/dispatch/BatchingExecutor.scala

Is Akka doing direct dispatch in case of typed actors on same dispatcher thread (if not: thanks god my bench is not covering this ;-)) ) ?

No, it doesn't, for the reasons mentioned above. Any distributed model based on synchrony seems like a bad idea.
Uhh, that's a very academic point of view.

What? I'm pretty sure it's a very Real World point of view.

The speed difference of a direct call and a message being queued is >1000 times. One can keep the contract on Actors, but optimize the dispatch in case they share same thread/dispatcher.

Blanket statement. Completely depends on what type of call (the morphicity of the callsite, what invoke-instruction), the implementation of the queue, the message being enqueued etc. The Disruptor project has shown that you can get quite extreme "speed" with "message passing".

As long synchronous results are forbidden, this does not affect functionality or behaviour of an Actor.

Yes, it *may* happen the receiver blocks due to ill behaviour.

Which is not an appropriate solution for non-academic software, if I may say so.

If the same ill Actor gets messages queued, it will get a queue overflow in most cases anyway. I'd consider this a bug that needs a fix. The performance tradeoff is
massive and forces coarse grained actor design, which on the other hand creates harder-to-balance apps.

I see your reasons, for me this is a no go out of practical considerations.

If you want maximum single threaded performance, just use normal code. No need for multithreading at all, just use one thread per logical partition of operations.

Cheers,

√

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://akka.io/faq/
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/groups/opt_out.

Rüdiger Möller

unread,

Jan 7, 2014, 3:16:18 PM1/7/14

to akka...@googlegroups.com

Am Dienstag, 7. Januar 2014 20:41:41 UTC+1 schrieb √:

On Tue, Jan 7, 2014 at 7:59 PM, Rüdiger Möller <moru...@gmail.com> wrote:

>> How is this failover and not "competing consumers"? (i.e. you have to notice someone is down before failing over, death and delay is indistinguishable in distributed systems)
>>
>
>
> Maybe I named it wrong, however one can do delayless failover this way. It doesn't affect any client as the second instance keeps responding, so there is no delay in processing.

But it requires that workstealing is OK which is a subset of cases.

How is that work stealing. Consider N receivers in identical state responding to the same requests (multicast, so requests are not sent twice). N receiver respond, but the requestor just takes the first response and ignores the other responses.

That's called "hot standby"—which may be fine for some use cases but not all: consider a messaged called "LaunchMissiles".

:)) i admit having had some personal experience with the "transaction of death". Gets even better in an event sourced system where restarting involves a message replay incl. the transaction of death ..

Is Akka doing direct dispatch in case of typed actors on same dispatcher thread (if not: thanks god my bench is not covering this ;-)) ) ?

No, it doesn't, for the reasons mentioned above. Any distributed model based on synchrony seems like a bad idea.
Uhh, that's a very academic point of view.

>Blanket statement. Completely depends on what type of call (the morphicity of the callsite, what invoke-instruction), the implementation of the queue, the message being enqueued etc. The Disruptor project has shown that >you can get quite extreme "speed" with "message passing".

Direct (hotspotified) method dispatch from a generated proxy still dwarfes any queue-based dispatch. Its not only the queuing, but the absense of inlining, handcraftet dispatch, allocation of queue entries, cache misses due to object allocation which hurts. Direct dispatch is allocation free.

As long synchronous results are forbidden, this does not affect functionality or behaviour of an Actor.

Yes, it *may* happen the receiver blocks due to ill behaviour.

> Which is not an appropriate solution for non-academic software, if I may say so.

I'd consider it a bug which should be fixed pre-production. There are classes of errors which cannot and should not get "repaired" at runtime, at least not with such a high price.

If the same ill Actor gets messages queued, it will get a queue overflow in most cases anyway. I'd consider this a bug that needs a fix. The performance tradeoff is
massive and forces coarse grained actor design, which on the other hand creates harder-to-balance apps.

I see your reasons, for me this is a no go out of practical considerations.

> If you want maximum single threaded performance, just use normal code. No need for multithreading at all, just use one thread per logical partition of operations.

Valid point. Downside is, one needs to decide at programming time which work is done single threaded. If one has the "direct dispatch" option, one may do a more fine grained actor design, later on
move some of the "local" actors to other dispatchers (statically by config or dynamically) in case. Additionally dynamic load balancing is applicable e.g. just do a "split" on an overloaded dispatcherthread into 2 different ones. With "always queue" actors the price for this "maybe" split is there even if it turns out your heap of actors consumes 30% of a thread only.

BTW: with new config results look much better :)

- ruediger

Rüdiger Möller

unread,

Jan 7, 2014, 3:58:52 PM1/7/14

to akka...@googlegroups.com

That looks better, Akka scales significantly better than multi threaded code. The abstraktor behaviour should be considered as a prototype, it for sure omits stuff which has to be done by a production-grade actor impl.

Anyway there seems to be contention (two threads access same memory frequently) probably in a queue impl of Akka. You are probably aware of Nitsan Wakarts work ? He offers some really well performing queues with very low contention, google his blog or check his code at github "jaq-in-a-box".

Results (Opteron dual socket X 8core 16T): http://imgur.com/p5Vskyu

The intel boxes are in the office so i can't test intel for now.

regards + happy hakking,

Rüdiger

√iktor Ҡlang

unread,

Jan 7, 2014, 4:32:38 PM1/7/14

to Akka User List

On Tue, Jan 7, 2014 at 9:16 PM, Rüdiger Möller <moru...@gmail.com> wrote:

Am Dienstag, 7. Januar 2014 20:41:41 UTC+1 schrieb √:

On Tue, Jan 7, 2014 at 7:59 PM, Rüdiger Möller <moru...@gmail.com> wrote:

>> How is this failover and not "competing consumers"? (i.e. you have to notice someone is down before failing over, death and delay is indistinguishable in distributed systems)
>>
>
>
> Maybe I named it wrong, however one can do delayless failover this way. It doesn't affect any client as the second instance keeps responding, so there is no delay in processing.

But it requires that workstealing is OK which is a subset of cases.

How is that work stealing. Consider N receivers in identical state responding to the same requests (multicast, so requests are not sent twice). N receiver respond, but the requestor just takes the first response and ignores the other responses.

That's called "hot standby"—which may be fine for some use cases but not all: consider a messaged called "LaunchMissiles".

:)) i admit having had some personal experience with the "transaction of death". Gets even better in an event sourced system where restarting involves a message replay incl. the transaction of death ..

Is Akka doing direct dispatch in case of typed actors on same dispatcher thread (if not: thanks god my bench is not covering this ;-)) ) ?

No, it doesn't, for the reasons mentioned above. Any distributed model based on synchrony seems like a bad idea.
Uhh, that's a very academic point of view.

>Blanket statement. Completely depends on what type of call (the morphicity of the callsite, what invoke-instruction), the implementation of the queue, the message being enqueued etc. The Disruptor project has shown that >you can get quite extreme "speed" with "message passing".

Direct (hotspotified) method dispatch from a generated proxy still dwarfes any queue-based dispatch. Its not only the queuing, but the absense of inlining, handcraftet dispatch, allocation of queue entries, cache misses due to object allocation which hurts. Direct dispatch is allocation free.

Yes, which is why with actors, you choose if you want to do the work using method calls within the receive or if you want to delegate it using messages.

Best of both worlds!

As long synchronous results are forbidden, this does not affect functionality or behaviour of an Actor.

Yes, it *may* happen the receiver blocks due to ill behaviour.

> Which is not an appropriate solution for non-academic software, if I may say so.

I'd consider it a bug which should be fixed pre-production. There are classes of errors which cannot and should not get "repaired" at runtime, at least not with such a high price.

You're assuming that all applications are deployed in one go. In a non-academic setting one would have to perform rolling upgrades and not all permutations of behavior can be divined.

If the same ill Actor gets messages queued, it will get a queue overflow in most cases anyway. I'd consider this a bug that needs a fix. The performance tradeoff is
massive and forces coarse grained actor design, which on the other hand creates harder-to-balance apps.

I see your reasons, for me this is a no go out of practical considerations.

> If you want maximum single threaded performance, just use normal code. No need for multithreading at all, just use one thread per logical partition of operations.

Valid point. Downside is, one needs to decide at programming time which work is done single threaded. If one has the "direct dispatch" option, one may do a more fine grained actor design, later on
move some of the "local" actors to other dispatchers (statically by config or dynamically) in case. Additionally dynamic load balancing is applicable e.g. just do a "split" on an overloaded dispatcherthread into 2 different ones. With "always queue" actors the price for this "maybe" split is there even if it turns out your heap of actors consumes 30% of a thread only.

You can always remove features to gain more chance of optimization, for instance Akka tracks message senders, which requires an envelope (payload + sender), if we didn't we could just enqueue the payload, if one uses a preallocated payload and a preallocated queue one would have 0 allocations for the message passing in-vm.

BTW: with new config results look much better :)

I sort of guessed that ;)

Cheers,

√

- ruediger

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://akka.io/faq/
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/groups/opt_out.

√iktor Ҡlang

unread,

Jan 7, 2014, 4:33:03 PM1/7/14

to Akka User List

What config did you run with?

Cheers,

√

--

>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://akka.io/faq/
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/groups/opt_out.

Justin du coeur

unread,

Jan 7, 2014, 4:49:11 PM1/7/14

to akka...@googlegroups.com

Coming into this from the outside (as an Architect who is building serious systems with Akka, not a member of the team), this struck me as a Big Red Flag:

On Tue, Jan 7, 2014 at 3:16 PM, Rüdiger Möller <moru...@gmail.com> wrote:

As long synchronous results are forbidden, this does not affect functionality or behaviour of an Actor.

Yes, it *may* happen the receiver blocks due to ill behaviour.

> Which is not an appropriate solution for non-academic software, if I may say so.

I'd consider it a bug which should be fixed pre-production. There are classes of errors which cannot and should not get "repaired" at runtime, at least not with such a high price.

I kind of wonder if you're missing the *point* of Akka. Seriously -- read more deeply into the system, and especially the "let it fail" mentality. It sounds to me like you're laser-focused on speed, and missing the point that robustness in the face of errors is a much higher priority in the Akka ecosystem. There's a pretty deeply-baked philosophical viewpoint that real code *always* has bugs, and the highest priority is to put clear bounds on how much those errors can cascade.

Speaking as a consumer of the system, I honestly find the benchmarks kind of irrelevant. I mean, saying that Akka isn't as fast as a hand-rolled system is simply stating the obvious: it's a big, complex and fairly mature framework, and that *always* comes at a price, since it has to trade off competing priorities. Frankly, I'm pleasantly surprised that the folks in the team can get speeds that are so *close* to your hand-rolled, given that Akka's trying to do a great deal more and has never been portrayed as the fastest thing on earth.

Can it be optimized further? Wouldn't surprise me, and I'm sure that the team is open to practical suggestions. But it's plenty fast *enough* for nearly all practical purposes, scalable to larger farms than I'm ever likely to need, and most importantly provides a really deep, robust and easy-to-use toolset for me to build upon. That matters a great deal more than raw speed in most situations...

Rajiv Kurian

unread,

Jan 8, 2014, 3:28:46 AM1/8/14

to akka...@googlegroups.com

On Tuesday, January 7, 2014 12:16:18 PM UTC-8, Rüdiger Möller wrote:

Am Dienstag, 7. Januar 2014 20:41:41 UTC+1 schrieb √:

On Tue, Jan 7, 2014 at 7:59 PM, Rüdiger Möller <moru...@gmail.com> wrote:

>> How is this failover and not "competing consumers"? (i.e. you have to notice someone is down before failing over, death and delay is indistinguishable in distributed systems)
>>
>
>
> Maybe I named it wrong, however one can do delayless failover this way. It doesn't affect any client as the second instance keeps responding, so there is no delay in processing.

But it requires that workstealing is OK which is a subset of cases.

How is that work stealing. Consider N receivers in identical state responding to the same requests (multicast, so requests are not sent twice). N receiver respond, but the requestor just takes the first response and ignores the other responses.

That's called "hot standby"—which may be fine for some use cases but not all: consider a messaged called "LaunchMissiles".

:)) i admit having had some personal experience with the "transaction of death". Gets even better in an event sourced system where restarting involves a message replay incl. the transaction of death ..

Is Akka doing direct dispatch in case of typed actors on same dispatcher thread (if not: thanks god my bench is not covering this ;-)) ) ?

No, it doesn't, for the reasons mentioned above. Any distributed model based on synchrony seems like a bad idea.
Uhh, that's a very academic point of view.

>Blanket statement. Completely depends on what type of call (the morphicity of the callsite, what invoke-instruction), the implementation of the queue, the message being enqueued etc. The Disruptor project has shown that >you can get quite extreme "speed" with "message passing".

Direct (hotspotified) method dispatch from a generated proxy still dwarfes any queue-based dispatch. Its not only the queuing, but the absense of inlining, handcraftet dispatch, allocation of queue entries, cache misses due to object allocation which hurts. Direct dispatch is allocation free.

Direct dispatch is fraught with problems. What happens if an actor message itself? It sees the partial results of a receive invocation because there was a second invocation before the first one was complete. Even if an actor doesn't msg itself cyclical communication patterns will cause the same problems with direct dispatch. If Actor A msgs Actor B we call Actor B's receive function, now if Actor B msgs Actor A in response to this "msg" we call Actor A's receive function while it was still processing the previous msg - i.e. actor contract violated again. I encountered this in my prototypes. It seemed like a great idea in the beginning. Maybe you are talking about something else though. An enqueue on a thread local queue for every send to an actor on the same dispatcher, followed by serial invocation of callbacks till the queue is empty worked out fine ... for now. I don't know what hidden problems lie with that approach either.

Rüdiger Möller

unread,

Jan 8, 2014, 8:45:53 AM1/8/14

to akka...@googlegroups.com

Good finding .. wasn't aware of this :-). As pointed out in another forum, it should be possible by keeping invocation counters on an actor. Have to figure out the details, but something along the lines of counting in/outgoing calls should work to avoid the problem you described.

Rüdiger Möller

unread,

Jan 8, 2014, 8:52:52 AM1/8/14

to akka...@googlegroups.com

I am "laser focussed" performance, because we have a very performance critical project. So this is special for my use case and I agree with you this is not an all-day requirement and many other appliciations have different needs and weight other features more important. However for me its important to get a raw feeling of how big a framework's tradeoff is (compared to handrolled special implementation).
And I think (after I changed the config), Akka is doing well here no doubt. The 1 million messages with tiny jobs test is kind of a smoke test crying for contention effects.

regards,
Rüdiger

Rüdiger Möller

unread,

Jan 10, 2014, 4:58:30 AM1/10/14

to akka...@googlegroups.com

Just for the books:

a) had cut to many corners in my reference prototype impl.

b) on Xeon it looks quite different compared to Opteron

Results with corrections on

2 socket Opteron each 8 core 16 T (16 real cores overall) @ 2.1 Ghz
2 socket Xeon each 6 core 12 T (12 real cores) @ 3 Ghz

Resulting charts:

http://imgur.com/0RnbRsX

regards,

Rüdiger

Roland Kuhn

unread,

Jan 10, 2014, 5:05:49 AM1/10/14

to akka-user

That looks like you are using ThreadPoolExecutor, but instead of guessing it would be much nicer if you could just publish the complete config matching these plots.

BTW: do I read that correctly that on Xeon Akka (with its full semantics) scales exactly as well as your “cut corners” prototype? ;-) (which would not surprise me at all … )

Thanks,

Roland

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://akka.io/faq/
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/groups/opt_out.

Dr. Roland Kuhn
Akka Tech Lead
Typesafe – Reactive apps on the JVM.
twitter: @rolandkuhn

Alec Zorab

unread,

Jan 10, 2014, 5:26:19 AM1/10/14

to akka...@googlegroups.com

For those of us who can't get onto imgur, would it be possible to either host them elsewhere or just attach the charts to a mail?

√iktor Ҡlang

unread,

Jan 10, 2014, 5:52:33 AM1/10/14

to Akka User List

Would be great to see your config.

Also, considering Xeon vs Opteron, there was a pretty nice bloke who I believe stated something along the lines of:

"Also, Opterons have pretty bad cache performance for inter-core comms (Intel uses inclusive L3s for faster on-package caches)"

Cheers,

√

--

>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://akka.io/faq/
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/groups/opt_out.

Rüdiger Möller

unread,

Jan 10, 2014, 5:58:23 AM1/10/14

to akka...@googlegroups.com

Am Freitag, 10. Januar 2014 11:05:49 UTC+1 schrieb rkuhn:

That looks like you are using ThreadPoolExecutor, but instead of guessing it would be much nicer if you could just publish the complete config matching these plots.

I use the config someone posted above

// Create an Akka system
        ActorSystem system = ActorSystem.create("PiSystem", ConfigFactory.parseString(
                "akka {\n" +
                        " actor.default-dispatcher {\n" +
                        "      fork-join-executor {\n" +
                        "        parallelism-min = 2\n" +
                        "        parallelism-factor = 0.4\n" +
                        "        parallelism-max = 16\n" +
                        "      }\n" +
                        "      throughput = 1000\n" +
                        " }\n" +
                        "\n" +
                        " log-dead-letters = off\n" +
                        "\n" +
                        " actor.default-mailbox {\n" +
                        "    mailbox-type = \"akka.dispatch.SingleConsumerOnlyUnboundedMailbox\"\n" +
                        " }\n" +
                        "}"
        )
        );

BTW: do I read that correctly that on Xeon Akka (with its full semantics) scales exactly as well as your “cut corners” prototype? ;-) (which would not surprise me at all … )

Yes, however the "cut corners" were not that massive in impact. CPU architecture matters (surprise). Don't get obsessed with the prototype, you know as I know we are living in a marketing driven world, so in order to evaluate products I always create a quick prototype to have a baseline to compare with. It frequently happens, that hyped products turn out to be pretty lame (e.g. like being 5 times slower). I usually recheck then if I omitted something important in the prototype and do a quick profile on the product. And in many cases i find some "Character c ..; out.write( c.toString().toByteArray() )" which means "stop-evaluating" ;-).

I'd say Akka is pretty close to what's possible (general purpose) on the JVM. I'd suspect <5-10% potential for application-specific optimization. That's good ! I am still a bit puzzled regarding proxy dispatch of typed actors (u know: programmers want code completion ;-) ), well just my gut ..

- ruediger

Rüdiger Möller

unread,

Jan 10, 2014, 6:08:15 AM1/10/14

to akka...@googlegroups.com

Yes, I had that in mind ofc., but at home i only have AMD boxes (+1 i7 box) just because I'm a contrarian ;). That's why I rechecked on an Intel box@office.
For testing/implementing concurrent applications opterons are not that bad, being contention detectors. On a side node, the best overall absolute throughput could be observed on the opteron machine (though much cheaper). Would be interesting to also run the test on my FX8350@4Ghz instead of the 2Ghz opteron, but no time for now ..
Unfortunately the test uses floating point a lot, so I cannot check how the Int-only cores of the AMD machines behave compared to Hyperthreading in reality.

√iktor Ҡlang

unread,

Jan 10, 2014, 6:09:42 AM1/10/14

to Akka User List

On Fri, Jan 10, 2014 at 11:58 AM, Rüdiger Möller <moru...@gmail.com> wrote:

Am Freitag, 10. Januar 2014 11:05:49 UTC+1 schrieb rkuhn:

That looks like you are using ThreadPoolExecutor, but instead of guessing it would be much nicer if you could just publish the complete config matching these plots.

I use the config someone posted above

// Create an Akka system
        ActorSystem system = ActorSystem.create("PiSystem", ConfigFactory.parseString(
                "akka {\n" +
                        " actor.default-dispatcher {\n" +
                        "      fork-join-executor {\n" +
                        "        parallelism-min = 2\n" +
                        "        parallelism-factor = 0.4\n" +

Cool, I recommend you to tune the parellelism-factor between 0.3 and 1.0 to find the optimum.

Rüdiger Möller

unread,

Jan 10, 2014, 6:10:50 AM1/10/14

to akka...@googlegroups.com

2014/1/10 Alec Zorab <alec...@gmail.com>

You received this message because you are subscribed to a topic in the Google Groups "Akka User List" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/akka-user/cIa580Z1RLk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to akka-user+...@googlegroups.com.

Roland Kuhn

unread,

Jan 10, 2014, 6:39:53 AM1/10/14

to akka-user

10 jan 2014 kl. 12:09 skrev √iktor Ҡlang <viktor...@gmail.com>:

On Fri, Jan 10, 2014 at 11:58 AM, Rüdiger Möller <moru...@gmail.com> wrote:

Am Freitag, 10. Januar 2014 11:05:49 UTC+1 schrieb rkuhn:

That looks like you are using ThreadPoolExecutor, but instead of guessing it would be much nicer if you could just publish the complete config matching these plots.

I use the config someone posted above

// Create an Akka system
        ActorSystem system = ActorSystem.create("PiSystem", ConfigFactory.parseString(
                "akka {\n" +
                        " actor.default-dispatcher {\n" +
                        "      fork-join-executor {\n" +
                        "        parallelism-min = 2\n" +
                        "        parallelism-factor = 0.4\n" +

Cool, I recommend you to tune the parellelism-factor between 0.3 and 1.0 to find the optimum.

I should add that having more threads than (active) actors will be detrimental to performance in most cases due to thread hopping (caused by aggressive work stealing). ForkJoinPool works best if all its threads have something to do, in which case the thread-local submission queues become effective and cache locality is improved. This means that for the low parallelism data points you will find opportunity for performance gains by reducing the number of threads—this learning process is exhibited by every external benchmark I have seen so far.

Of course you can always screw up worse, e.g. by setting parallelism-factor=100 … (sorry for the tangential side-rant).