A Fork-Join Calamity

2,918 views
Skip to first unread message

Martin Thompson

unread,
Apr 13, 2014, 3:43:31 AM4/13/14
to
The following article proposes that the FJ framework and parallel collections could be a calamity for Java. 


I've been uncomfortable with FJ for some time. I see people struggling to design for it, debug it, and more often fail to get performance benefits. Other approaches such as pipelining tasks can often be way more effective and easier to reason about.

I also find it amusing that after years of trying to hide databases behind ORMs that to use the parallel collections effectively you need to understand set theory for writing good queries.

The following blog shows just how bloggers can so easily misuse parallel collections by having no sympathy for CPU resource on a system. I think this is only the tip of the iceberg.


I'm curious to know if others have doubts about either Fork-Join or parallel collections, or if these are really good ideas and somehow the penny has not dropped for me? I'd really like to see a good evidence based debate on this subject.

Regards,
Martin...


Peter Lawrey

unread,
Apr 13, 2014, 4:09:48 AM4/13/14
to mechanica...@googlegroups.com

I agree with everything you have said. Even for simpler frameworks, there is still a surprising number of ways to misuse them. To some degree this was Java's brilliance. A minimum of features which minimise edge cases. To be fair, my own libraries are *really* bad in this regard. ;)

I recently had cause to migrate some C# code to Java and have seen some cool uses of closures but also some really dire ones.

    List<String> list = new ArrayList<>();
    list.stream().forEach(p -> listA.add(p));
    list.stream().forEach(p -> listB.add(p));

One of the big things is that stream() methods like sum() don't work on BigDecimal or BigInteger.  Why is using BigDecimal in Java so painful. :(

On 13 Apr 2014 08:38, "Martin Thompson" <mjp...@gmail.com> wrote:
The following article proposes that the FJ framework and parallel collections could be a calamity for Java. 


I've been uncomfortable with FJ for some time. I see people struggling to design for it, debug it, and more often fail to get performance benefits. Other approaches such as pipelining tasks can often be way more effective and easier to reason about.

I also find it amusing that after years of trying to hide databases behind ORMs that to use the parallel collections effectively you need to understand set theory for writing good queries.

The following blog shows just how bloggers can so easily misuse parallel collections by having no sympathy for CPU resource on a system. I think this is only the tip of the iceberg.


I'm curious to know if others have doubts about either Fork-Join or parallel collections, or if these are a really good ideas and somehow the penny has not dropped for me? I'd really like to see a good evidence based debate on this subject.

Regards,
Martin...


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kirk Pepperdine

unread,
Apr 13, 2014, 4:37:27 AM4/13/14
to mechanica...@googlegroups.com
I would have to say that every highly scalable system I’ve been involved with has employed pipe-lining. FJ is interesting but IMHO it’s use cases are limited and questionable. Unfortunately I fear that FJ nepotism has influenced Java’s implementation of Lambda’s. I smell a “Spring” like opportunity here.

Regards,
Kirk

Remi Forax

unread,
Apr 13, 2014, 8:01:38 AM4/13/14
to mechanica...@googlegroups.com
On 04/13/2014 09:38 AM, Martin Thompson wrote:
> The following article proposes that the FJ framework and parallel
> collections could be a calamity for Java.
>
> http://coopsoft.com/ar/Calamity2Article.html
>
> I've been uncomfortable with FJ for some time. I see people struggling
> to design for it, debug it, and more often fail to get performance
> benefits. Other approaches such as pipelining tasks can often be way
> more effective and easier to reason about.
>
> I also find it amusing that after years of trying to hide databases
> behind ORMs that to use the parallel collections effectively you need
> to understand set theory for writing good queries.
>
> The following blog shows just how bloggers can so easily misuse
> parallel collections by having no sympathy for CPU resource on a
> system. I think this is only the tip of the iceberg.
>
> http://www.takipiblog.com/2014/04/03/new-parallelism-apis-in-java-8-behind-the-glitz-and-glamour/
> <http://www.takipiblog.com/2014/04/03/new-parallelism-apis-in-java-8-behind-the-glitz-and-glamour/>
>
> I'm curious to know if others have doubts about either Fork-Join or
> parallel collections, or if these are a really good ideas and somehow
> the penny has not dropped for me? I'd really like to see a good
> evidence based debate on this subject.
>
> Regards,
> Martin...

Like any workers pool, the default fork/join pool (ForkPools.commonPool)
used by the parallel Stream API has to be configured globally for the
whole application. The default configuration consider that all cores are
available which is obviously wrong if you have a server.
So what ?

Rémi

>
>
> --
> You received this message because you are subscribed to the Google
> Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to mechanical-symp...@googlegroups.com
> <mailto:mechanical-symp...@googlegroups.com>.

Remi Forax

unread,
Apr 13, 2014, 8:20:10 AM4/13/14
to mechanica...@googlegroups.com
On 04/13/2014 10:37 AM, Kirk Pepperdine wrote:
> I would have to say that every highly scalable system I’ve been
> involved with has employed pipe-lining. FJ is interesting but IMHO
> it’s use cases are limited and questionable. Unfortunately I fear that
> FJ nepotism has influenced Java’s implementation of Lambda’s.

Hi Kirk,
You're right that the implementation of the parallel part of the Stream
API is fully FJ based.
You can use the word "nepotism" to describe this fact, I prefer to talk
about short of skilled people willing to help.

> I smell a “Spring” like opportunity here.

Having a SPI mechanism for the Stream API is now scheduled for Java 9
(that's why StreamSupport exists).

>
> Regards,
> Kirk

cheers,
Rémi

>
> On Apr 13, 2014, at 9:38 AM, Martin Thompson <mjp...@gmail.com
>> <mailto:mechanical-symp...@googlegroups.com>.
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google
> Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to mechanical-symp...@googlegroups.com
> <mailto:mechanical-symp...@googlegroups.com>.

Kirk Pepperdine

unread,
Apr 13, 2014, 8:40:49 AM4/13/14
to mechanica...@googlegroups.com
Hi Remi,

>
> Hi Kirk,
> You're right that the implementation of the parallel part of the Stream API is fully FJ based.
> You can use the word "nepotism" to describe this fact, I prefer to talk about short of skilled people willing to help.

There are certainly cases where the community would have liked to have contributed but conditions on the ground (ie how Oracle treats community) make it difficult. There are way too many behind closed door conversations for OpenJDK to be… well.. open.
As for lack of skilled people… the JVM logging JEP is a fine example of that.. the JEP defenders know nothing about messaging… I know because I spoke with them. They are ill equipped to implement, let alone propose a logging API… Our own Peter Lawrey has Java Chronicle which could be used as a model to support that JEP but I fear a NIH attitude would prevent that from happening… better to copy the worst example of logging frameworks that java developers have been forced to live with for years. So yeah, theres a bit of despair in the air about Oracle’s inability to interact with a real community.. ;-)

First and foremost I really don’t want to cast a negative air over the great effort that Brian and his minions have put forth to get Lambda’s out the door.... that said, when you get past the shock and awe demo’s that have been making the conference circuit and try to convert that to something in the large…. you get a sense of the limitation that FJ has placed on Lambdas. Initially Lambda’s make Java feel more like what OO programming was suppose to be like.. then you immediately feel the limitation in the streams and filters. They are meant to work for about 1 use case and to suggest i might want to use them in another way didn’t quite result in the response I’d anticipated..

>
>> I smell a "Spring" like opportunity here.
>
> Having a SPI mechanism for the Stream API is now scheduled for Java 9 (that's why StreamSupport exists).

A Streams SPI will be appreciated!!!

Regards,
Kirk

Martin Thompson

unread,
Apr 13, 2014, 8:49:27 AM4/13/14
to mechanica...@googlegroups.com

Like any workers pool, the default fork/join pool (ForkPools.commonPool)
used by the parallel Stream API has to be configured globally for the
whole application. The default configuration consider that all cores are
available which is obviously wrong if you have a server.
So what ?

The streaming API is aimed at general use from what I've heard on the conference circuit of late. That means it is sharing a machine with lots of other threads involved in "general use" within applications, e.g. a web container with many threads in its pool. If the default is to assume exclusive access to the system resources then I'd say that is somewhat naive. The same can be said for any component/framework that starts its own "inconsiderate" thread pool. For map-reduce on a system like Hadoop people assume exclusive access to system resources. I'd argue that streams are not the same. If they are to be used in general programming and co-exist with larger solutions then there is something significant missing.

Let's take a step back and consider usability. It does not matter what any of our own personal views are. If any technology causes users to make usability mistakes then it has low affordance. We have somewhat learned our lessons here in UX design but we seem to be in the dark ages about usability when it comes to lower level components.

Remi Forax

unread,
Apr 13, 2014, 9:24:59 AM4/13/14
to mechanica...@googlegroups.com
On 04/13/2014 02:40 PM, Kirk Pepperdine wrote:
> Hi Remi,
>
>> Hi Kirk,
>> You're right that the implementation of the parallel part of the Stream API is fully FJ based.
>> You can use the word "nepotism" to describe this fact, I prefer to talk about short of skilled people willing to help.
> There are certainly cases where the community would have liked to have contributed but conditions on the ground (ie how Oracle treats community) make it difficult. There are way too many behind closed door conversations for OpenJDK to be... well.. open.

It doesn't match my experience. I think that things improved.
By example, the mailing list for invokedynamic was private while the one
for lambdas and streams are public.

> As for lack of skilled people... the JVM logging JEP is a fine example of that.. the JEP defenders know nothing about messaging... I know because I spoke with them. They are ill equipped to implement, let alone propose a logging API... Our own Peter Lawrey has Java Chronicle which could be used as a model to support that JEP but I fear a NIH attitude would prevent that from happening... better to copy the worst example of logging frameworks that java developers have been forced to live with for years. So yeah, theres a bit of despair in the air about Oracle's inability to interact with a real community.. ;-)

Yes, JEP current process is mostly one way, the ways to improve the
process was discussed in the open, I think it was at latest FOSDEM, Mark
Reinhold recently proposed a new JEP process.
http://cr.openjdk.java.net/~mr/jep/jep-2.0.html

>
> First and foremost I really don't want to cast a negative air over the great effort that Brian and his minions have put forth to get Lambda's out the door....

:)

> that said, when you get past the shock and awe demo's that have been making the conference circuit and try to convert that to something in the large.... you get a sense of the limitation that FJ has placed on Lambdas. Initially Lambda's make Java feel more like what OO programming was suppose to be like.. then you immediately feel the limitation in the streams and filters. They are meant to work for about 1 use case and to suggest i might want to use them in another way didn't quite result in the response I'd anticipated..

could you be a little more specific, I have hard time to figure out what
you are talking about ?

cheers,
Rémi

Remi Forax

unread,
Apr 13, 2014, 10:24:20 AM4/13/14
to mechanica...@googlegroups.com
On 04/13/2014 02:49 PM, Martin Thompson wrote:
>
> Like any workers pool, the default fork/join pool
> (ForkPools.commonPool)
> used by the parallel Stream API has to be configured globally for the
> whole application. The default configuration consider that all
> cores are
> available which is obviously wrong if you have a server.
> So what ?
>
>
> The streaming API is aimed at general use from what I've heard on the
> conference circuit of late. That means it is sharing a machine with
> lots of other threads involved in "general use" within applications,
> e.g. a web container with many threads in its pool. If the default is
> to assume exclusive access to the system resources then I'd say that
> is somewhat naive. The same can be said for any component/framework
> that starts its own "inconsiderate" thread pool. For map-reduce on a
> system like Hadoop people assume exclusive access to system resources.
> I'd argue that streams are not the same. If they are to be used in
> general programming and co-exist with larger solutions then there is
> something significant missing.

This is a little different here because this is part of the JDK and not
a framework.
When web containers will support Java 8, most of them will either set
the property "java.util.concurrent.ForkJoinPool.common.parallelism" or
use a specific fork/join pool instead of the default one [1].

BTW, this was discussed in the open lambda-util mailing list,
http://cs.oswego.edu/pipermail/lambda-lib/2011-March/000122.html
and you can verify by yourself that the current implementation is not
different from what was decided
http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/7eea0aa4c7ad/src/share/classes/java/util/concurrent/ForkJoinPool.java#l3300
http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/7eea0aa4c7ad/src/share/classes/java/util/concurrent/ForkJoinTask.java#l691

>
> Let's take a step back and consider usability. It does not matter what
> any of our own personal views are. If any technology causes users to
> make usability mistakes then it has low affordance. We have somewhat
> learned our lessons here in UX design but we seem to be in the dark
> ages about usability when it comes to lower level components.

I agree with you that it should work out of the box and I think the
current state is mostly due to the fact that most web containers are not
yet configured to work with Java 8.

cheers,
Rémi
[1]
http://stackoverflow.com/questions/21163108/custom-thread-pool-in-java-8-parallel-stream


Kirk Pepperdine

unread,
Apr 13, 2014, 10:26:24 AM4/13/14
to mechanica...@googlegroups.com
Hi Remi,

>
> It doesn't match my experience. I think that things improved.

In certain areas certainly.. but in a number of cases one can only sort out things after the fact.. which is often too late to affect meaningful change.. without seeming like a disruptive force…
>
> Yes, JEP current process is mostly one way, the ways to improve the process was discussed in the open, I think it was at latest FOSDEM, Mark Reinhold recently proposed a new JEP process.
> http://cr.openjdk.java.net/~mr/jep/jep-2.0.html

Right, while making an announcement @ some random conference is ok.. what about those that weren’t able to be @ FOSDEM? Mark, of all people, should know the the Java community is very fragmented as apposed to OSG.. a model in which Oracle is trying to borg all of the JUGs into… But this is really getting off topic so I’ll take this one off-line.
>
>>
>> First and foremost I really don't want to cast a negative air over the great effort that Brian and his minions have put forth to get Lambda's out the door....
>
> :)
>
>> that said, when you get past the shock and awe demo's that have been making the conference circuit and try to convert that to something in the large.... you get a sense of the limitation that FJ has placed on Lambdas. Initially Lambda's make Java feel more like what OO programming was suppose to be like.. then you immediately feel the limitation in the streams and filters. They are meant to work for about 1 use case and to suggest i might want to use them in another way didn't quite result in the response I'd anticipated..
>
> could you be a little more specific, I have hard time to figure out what you are talking about ?

Sorry for being so terse…. I’d like to fork streams as one example.. or have filters to split streams into multiple streams based on some criteria, like a tap? the response to fork() was, you could run your JVM out of memory should you have a slow consumer. so we don’t want to provide that feature.

To Martin’s point, having the parallelism feed from a common configurable thread pool might help. I’ve certainly run into situations where competing threading requirements/pools completely crushed a server.

Regards,
Kirk

Aleksey Shipilev

unread,
Apr 13, 2014, 11:29:56 AM4/13/14
to mechanica...@googlegroups.com
On 04/13/2014 11:38 AM, Martin Thompson wrote:
> The following article proposes that the FJ framework and parallel
> collections could be a calamity for Java.
>
> http://coopsoft.com/ar/Calamity2Article.html

Oh, this article is still around... If you read it carefully, you will
notice that after listing all the bad things about FJP, it does not
offer a sound alternative. The "alternative" presented is to scrap the
current highly-tuned and bugfixed implementation, and start over. As if
that magically guarantees the second thing is implicitly better.

I think that FJP is the only viable option for the divide-and-conquer
style of tasks. Of course it has shortcomings, because that's how the
world works: there are always tradeoffs. If you know some other
implementation that makes better tradeoffs, then you should totally come
forward with it and advocate its inclusion into JDK. Or, at least make
that claim substantial, meaning getting a sound comparative research.
Listing all the shortcomings for a particular framework is hardly a
research. Oh wait, this feels like something I said before...

http://mail.openjdk.java.net/pipermail/lambda-dev/2012-October/006169.html
http://mail.openjdk.java.net/pipermail/lambda-dev/2012-October/006177.html

Saying one framework is bad without bringing up the viable alternative
is as constructive as the rally against the Second Law of
Thermodynamics. It sure feels like you can reverse entropy and bring the
happiness to a dying Universe, but that belief quickly diminishes as you
actually try to do it, because you haven't considered *why* those
pitfalls are there.


> I've been uncomfortable with FJ for some time. I see people struggling
> to design for it, debug it, and more often fail to get performance
> benefits. Other approaches such as pipelining tasks can often be way
> more effective and easier to reason about.

I am comfortable with FJP, and I am happy seeing its use in JDK 8
Streams, because, frankly, you don't frequently see the execution
frameworks with that kind of performance magic: striped submission
queues, in-submitter execution while pool threads ramp up, false sharing
avoidance in thread queues, randomized balancing with super-fast PRNGs,
lock-free/relaxed-ops work queues, avoiding multiword-cas/locked
implementations of control words, branch prediction considerations, etc.

FJP tackles the problem of exploiting the internal parallelism without
sacrificing the external one. How successful is pipelining at those
things? I mean, surely, you can do something like Disruptor with
busy-wait handoffs, but in my mind, it is even more "non-sympathetic" to
other code than running a few additional pools full of threads.

The performance model for parallel execution is very hard and in most
cases context-depending, and this is why JDK 8 *did not* made parallel()
implicit. You might find it a coward move to burden the developers with
choice to make the computation parallel or sequential, but once again,
consider the alternatives: either implicit parallel() which blows up
unexpectedly and nothing can be done to turn it off, or not doing
parallel() at all.

> I also find it amusing that after years of trying to hide databases
> behind ORMs that to use the parallel collections effectively you need to
> understand set theory for writing good queries.
>
> The following blog shows just how bloggers can so easily misuse parallel
> collections by having no sympathy for CPU resource on a system. I think
> this is only the tip of the iceberg.

>
> http://www.takipiblog.com/2014/04/03/new-parallelism-apis-in-java-8-behind-the-glitz-and-glamour/
> <http://www.takipiblog.com/2014/04/03/new-parallelism-apis-in-java-8-behind-the-glitz-and-glamour/>

Breaking news: not a silver bullet again! You can't actually run faster
with #Threads > #CPUs! That parallel() thing is a lie!

If you look at the benchmarks there... well... um... I would just say it
is a good exercise for seasoned benchmark guys to spot the mistakes
which make the results questionable. Anyway, if we want to *speculate*
the experimental setup is miraculously giving us the sane performance data:

* Sorting in now only 20% faster – a 23X decline.
* Filtering is now only 20% faster – a 25X decline.
* Grouping is now 15% slower.

(That 23X-25X decline is red herring because it compares the results of
two different tests).

Am I reading it right? You put 10 client threads submitting the same
task in the pool, and you are *still* 20% faster on parallel tests? And
that is on 8 hardware threads machine (which is a funny pitfall on its
own)? That means, even when external parallelism is present, you can
still enjoy the benefits of the internal one? Or that is a fever dream
of an overloaded machine?

-Aleksey.

Remi Forax

unread,
Apr 13, 2014, 11:44:43 AM4/13/14
to mechanica...@googlegroups.com
On 04/13/2014 04:26 PM, Kirk Pepperdine wrote:
> Hi Remi,
>
>> It doesn't match my experience. I think that things improved.
> In certain areas certainly.. but in a number of cases one can only sort out things after the fact.. which is often too late to affect meaningful change.. without seeming like a disruptive force...
>> Yes, JEP current process is mostly one way, the ways to improve the process was discussed in the open, I think it was at latest FOSDEM, Mark Reinhold recently proposed a new JEP process.
>> http://cr.openjdk.java.net/~mr/jep/jep-2.0.html
> Right, while making an announcement @ some random conference is ok.. what about those that weren't able to be @ FOSDEM? Mark, of all people, should know the the Java community is very fragmented as apposed to OSG.. a model in which Oracle is trying to borg all of the JUGs into... But this is really getting off topic so I'll take this one off-line.

Just to be clear, Mark did not do any announcement, after Mark's
presentation on another topic, a guy ask why the JEPs have no mailing
list (or bug tracking, I can't remember), a discussion spring up and as
a result Mark proposes a new JEP process.


>>> First and foremost I really don't want to cast a negative air over the great effort that Brian and his minions have put forth to get Lambda's out the door....
>> :)
>>
>>> that said, when you get past the shock and awe demo's that have been making the conference circuit and try to convert that to something in the large.... you get a sense of the limitation that FJ has placed on Lambdas. Initially Lambda's make Java feel more like what OO programming was suppose to be like.. then you immediately feel the limitation in the streams and filters. They are meant to work for about 1 use case and to suggest i might want to use them in another way didn't quite result in the response I'd anticipated..
>> could you be a little more specific, I have hard time to figure out what you are talking about ?
> Sorry for being so terse.... I'd like to fork streams as one example.. or have filters to split streams into multiple streams based on some criteria, like a tap? the response to fork() was, you could run your JVM out of memory should you have a slow consumer. so we don't want to provide that feature.

You can not split a stream into multiple streams because of the way the
API is crafted,
it works like that, you construct your pipeline using lazy operations
and at the end
you call a terminal operation that start the pump that take values and
push them on the pipeline.
Because it's the terminal operation that starts the process you can not
have more than one tail.

Now, if you want to implement fork() i.e. using another thread to
process a sub-stream
with a queue in the middle, you can use Stream.peek() which is
equivalent to what you call tap.

>
> To Martin's point, having the parallelism feed from a common configurable thread pool might help. I've certainly run into situations where competing threading requirements/pools completely crushed a server.

see my answer to Martin.

>
> Regards,
> Kirk
>

cheers,
Rémi

Kirk Pepperdine

unread,
Apr 13, 2014, 11:46:58 AM4/13/14
to mechanica...@googlegroups.com

On Apr 13, 2014, at 5:44 PM, Remi Forax <fo...@univ-mlv.fr> wrote:

> On 04/13/2014 04:26 PM, Kirk Pepperdine wrote:
>> Hi Remi,
>>
>>> It doesn't match my experience. I think that things improved.
>> In certain areas certainly.. but in a number of cases one can only sort out things after the fact.. which is often too late to affect meaningful change.. without seeming like a disruptive force...
>>> Yes, JEP current process is mostly one way, the ways to improve the process was discussed in the open, I think it was at latest FOSDEM, Mark Reinhold recently proposed a new JEP process.
>>> http://cr.openjdk.java.net/~mr/jep/jep-2.0.html
>> Right, while making an announcement @ some random conference is ok.. what about those that weren't able to be @ FOSDEM? Mark, of all people, should know the the Java community is very fragmented as apposed to OSG.. a model in which Oracle is trying to borg all of the JUGs into... But this is really getting off topic so I'll take this one off-line.
>
> Just to be clear, Mark did not do any announcement, after Mark's presentation on another topic, a guy ask why the JEPs have no mailing list (or bug tracking, I can't remember), a discussion spring up and as a result Mark proposes a new JEP process.

I think you’ve made my point… thanks ;-)

>
>
>>>> First and foremost I really don't want to cast a negative air over the great effort that Brian and his minions have put forth to get Lambda's out the door....
>>> :)
>>>
>>>> that said, when you get past the shock and awe demo's that have been making the conference circuit and try to convert that to something in the large.... you get a sense of the limitation that FJ has placed on Lambdas. Initially Lambda's make Java feel more like what OO programming was suppose to be like.. then you immediately feel the limitation in the streams and filters. They are meant to work for about 1 use case and to suggest i might want to use them in another way didn't quite result in the response I'd anticipated..
>>> could you be a little more specific, I have hard time to figure out what you are talking about ?
>> Sorry for being so terse.... I'd like to fork streams as one example.. or have filters to split streams into multiple streams based on some criteria, like a tap? the response to fork() was, you could run your JVM out of memory should you have a slow consumer. so we don't want to provide that feature.
>
> You can not split a stream into multiple streams because of the way the API is crafted,

Again, you’ve made my point… and thanks again ;-)

Regards,
Kirk

Martin Thompson

unread,
Apr 13, 2014, 12:02:54 PM4/13/14
to mechanica...@googlegroups.com
I'm trying to stimulate a healthy debate and increase the understanding in our community. My primary goal to to see software developed that delivers real value for a business.

Divide-and-conquer is one way to address parallel computing. You are right in that it is a shame this paper is very one sided. However I think the core focus on parallelism within the Java community is very one sided towards shared memory designs and FJ. From personal experience on high volume systems I've seen a lot of success with pipeline and actor models, plus shared nothing is significantly easier to reason about and tune. I'd like to see a lot more open thinking in this area from the core Java team.

A very valid alternative is to get better at writing single threaded code. It is amazing what can be achieved on a single thread when code is not grossly inefficient. Also without going parallel, and/or concurrent, the code is so much easier to reason about. But this has a big drawback, no company can market writing better code as a product they can sell on any scale.

Every design has benefits and consequences. Let's discuss both sides freely so all can learn and make informed decisions.

> I've been uncomfortable with FJ for some time. I see people struggling
> to design for it, debug it, and more often fail to get performance
> benefits. Other approaches such as pipelining tasks can often be way
> more effective and easier to reason about.

I am comfortable with FJP, and I am happy seeing its use in JDK 8
Streams, because, frankly, you don't frequently see the execution
frameworks with that kind of performance magic: striped submission
queues, in-submitter execution while pool threads ramp up, false sharing
avoidance in thread queues, randomized balancing with super-fast PRNGs,
lock-free/relaxed-ops work queues, avoiding multiword-cas/locked
implementations of control words, branch prediction considerations, etc.

FJP tackles the problem of exploiting the internal parallelism without
sacrificing the external one. How successful is pipelining at those
things? I mean, surely, you can do something like Disruptor with
busy-wait handoffs, but in my mind, it is even more "non-sympathetic" to
other code than running a few additional pools full of threads.

I would not suggest the Disruptor for general purpose use. It is far too specialised. I have said this openly many times. Busy spin strategies are best suited to environments where the number of available cores is greater than the number of threads wanting to run.
 
>
> The following blog shows just how bloggers can so easily misuse parallel
> collections by having no sympathy for CPU resource on a system. I think
> this is only the tip of the iceberg.

>
>   http://www.takipiblog.com/2014/04/03/new-parallelism-apis-in-java-8-behind-the-glitz-and-glamour/
> <http://www.takipiblog.com/2014/04/03/new-parallelism-apis-in-java-8-behind-the-glitz-and-glamour/>

Breaking news: not a silver bullet again! You can't actually run faster
with #Threads > #CPUs! That parallel() thing is a lie!

If you look at the benchmarks there... well... um... I would just say it
is a good exercise for seasoned benchmark guys to spot the mistakes
which make the results questionable. Anyway, if we want to *speculate*
the experimental setup is miraculously giving us the sane performance data:

 * Sorting in now only 20% faster – a 23X decline.
 * Filtering is now only 20% faster – a 25X decline.
 * Grouping is now 15% slower.

(That 23X-25X decline is red herring because it compares the results of
two different tests).

Am I reading it right? You put 10 client threads submitting the same
task in the pool, and you are *still* 20% faster on parallel tests? And
that is on 8 hardware threads machine (which is a funny pitfall on its
own)? That means, even when external parallelism is present, you can
still enjoy the benefits of the internal one? Or that is a fever dream
of an overloaded machine?

I hope you do not think I'm advocating this blog!!! Quite the opposite. I think it is an example of how easy it is to get a mental model very wrong with parallel streams, never mind how broken the benchmark is.

Loads of people are going to want to use this cool new feature. Well at least us who earn a living from consulting can look forward to increased sources of revenue as people dig some very deep holes.

Martin...

Kirk Pepperdine

unread,
Apr 13, 2014, 12:39:38 PM4/13/14
to mechanica...@googlegroups.com
I don’t see FJP as being mutually exclusive from pipelining. With the ability to selectively fork a stream they should be able to work hand in hand.

As for single threaded performance, 3 weeks ago we took an app performing work on FPML documents up to 5,5 million TPS on a laptop by simply focusing on single thread performance. Unfortunately using lambda’s (in their current form) we would have never have been able to reach this number as my own benching shows a 20x performance hit when moving from the classical imperative code to using lambda’s. I only wish I could share the code but...

Regards,
Kirk

Richard Warburton

unread,
Apr 13, 2014, 12:56:00 PM4/13/14
to mechanica...@googlegroups.com
Hi,

One of the big things is that stream() methods like sum() don't work on BigDecimal or BigInteger.  Why is using BigDecimal in Java so painful. :(

Well they've focused on the primitive specialised stream variants, which really needed core library support. You can implement sum as reduce(BigDecimal.Zero, BigDecimal::add) so its not that painful.

regards,

  Richard Warburton

Richard Warburton

unread,
Apr 13, 2014, 12:58:39 PM4/13/14
to mechanica...@googlegroups.com
Hi,

I would have to say that every highly scalable system I’ve been involved with has employed pipe-lining. FJ is interesting but IMHO it’s use cases are limited and questionable. Unfortunately I fear that FJ nepotism has influenced Java’s implementation of Lambda’s. I smell a “Spring” like opportunity here.

Not so much the implementation of Lambdas; more the implementation of Streams. I do agree with you though Kirk: it would be easy for a 3rd party library to provide an alternative.

Aleksey Shipilev

unread,
Apr 13, 2014, 1:06:56 PM4/13/14
to mechanica...@googlegroups.com
On 04/13/2014 08:02 PM, Martin Thompson wrote:
> I'm trying to stimulate a healthy debate and increase the understanding
> in our community. My primary goal to to see software developed that
> delivers real value for a business.

Ok, the real value for business is code brevity, which means more
readability, more expressiveness, less bugs, less maintenance burden.
You seem to be leaning towards peak performance, and that thing is at
odds with usability. For 99.9999% of businesses peak application
performance is the second order concern. If there is an easy performance
boost with minimal effort, business will go there as well.

> Divide-and-conquer is one way to address parallel computing. You are
> right in that it is a shame this paper is very one sided. However I
> think the core focus on parallelism within the Java community is very
> one sided towards shared memory designs and FJ.

Ummm. How would you do otherwise with the language which embraces shared
memory? Anyway, that statement is invalidated by Akka (which is an
obvious departure from shared memory model) that is driven by FJP. Why?
Because metal is shared memory, and to have close to bare metal
performance, you have to face shared memory at some level.

> From personal experience
> on high volume systems I've seen a lot of success with pipeline and
> actor models, plus shared nothing is significantly easier to reason
> about and tune. I'd like to see a lot more open thinking in this area
> from the core Java team.

Good. The existence of these libraries, and their high performance
tailored with maintainability is the direct consequence of many core
developers from both those libraries and Java core facing the music
*for* users. This is why JDK 8 Streams are exposeable to users: users
should not be burdened with low-level stuff for their code to "just work".

> A very valid alternative is to get better at writing single threaded
> code. It is amazing what can be achieved on a single thread when code is
> not grossly inefficient. Also without going parallel, and/or concurrent,
> the code is so much easier to reason about. But this has a big drawback,
> no company can market writing better code as a product they can sell on
> any scale.

Sure, you can write single-threaded code, except for the cases you
can't. That seems a very generic and self-obvious claim, so I can't
follow that point any further.

> I hope you do not think I'm advocating this blog!!! Quite the opposite.
> I think it is an example of how easy it is to get a mental model very
> wrong with parallel streams, never mind how broken the benchmark is.

Every technology and every improvement complicates the mental model
(even those trying to simplify the model, surprisingly, because they do
it by shaving off the unnecessary details, which are, at times, very
necessary -- and there's no way out of this, because the Universe is
complicated and you can only re-balance complexity, not hide it).
Parallel streams are not exception to this rule.

-Aleksey.

Richard Warburton

unread,
Apr 13, 2014, 1:07:18 PM4/13/14
to mechanica...@googlegroups.com
Hi,

The streaming API is aimed at general use from what I've heard on the conference circuit of late. That means it is sharing a machine with lots of other threads involved in "general use" within applications, e.g. a web container with many threads in its pool. If the default is to assume exclusive access to the system resources then I'd say that is somewhat naive. The same can be said for any component/framework that starts its own "inconsiderate" thread pool.

I was under the impression that things work a little bit different from that actually. My understanding was that when your JVM is running in a Java EE container in order to stop different threads from being blocked from making progress by waiting on results coming back from an overly contended FJP even if you run .parallelStream() then things are sequential.

Peter Lawrey

unread,
Apr 13, 2014, 1:40:41 PM4/13/14
to mechanica...@googlegroups.com

Thank you for the tip.

--

Martin Thompson

unread,
Apr 13, 2014, 1:43:21 PM4/13/14
to mechanica...@googlegroups.com
On 13 April 2014 18:06, Aleksey Shipilev <aleksey....@gmail.com> wrote:
On 04/13/2014 08:02 PM, Martin Thompson wrote:
> I'm trying to stimulate a healthy debate and increase the understanding
> in our community. My primary goal to to see software developed that
> delivers real value for a business.

Ok, the real value for business is code brevity, which means more
readability, more expressiveness, less bugs, less maintenance burden.
You seem to be leaning towards peak performance, and that thing is at
odds with usability. For 99.9999% of businesses peak application
performance is the second order concern. If there is an easy performance
boost with minimal effort, business will go there as well.

How did I give the impression I'm leaning towards peak performance? I'm only exploring the subject of parallel streams and FJ for if they meet their goals.

Performance is a misdirection in this context. Going parallel is this context is about increasing utilisation of our modern multicore hardware.

Code brevity itself does not necessarily lead to increased business value. Just look at Perl against the criteria you list. Here you are saying business value is coming from parallel streams making things easier then later you say every technology "complicates the mental model". This feels like a contradiction.

For code to be maintainable it must be clear and easy to reason about.  I think many would argue that larger scale apps built with FJ or Map-Reduce are not easy to maintain or debug.

I hope the point of this discussion for people to understand the benefits and be able to avoid the pitfalls as they increase understanding.

> Divide-and-conquer is one way to address parallel computing. You are
> right in that it is a shame this paper is very one sided. However I
> think the core focus on parallelism within the Java community is very
> one sided towards shared memory designs and FJ.

Ummm. How would you do otherwise with the language which embraces shared
memory? Anyway, that statement is invalidated by Akka (which is an
obvious departure from shared memory model) that is driven by FJP. Why?
Because metal is shared memory, and to have close to bare metal
performance, you have to face shared memory at some level.

The statement is not invalidated by Akka. Akka is from the Scala community and not to be found in the JDK or JEE. Also FJP is only one of many possible ways of scheduling actors.

Real performance and utilisation comes from cores working on memory in their core local caches free from the contention of other cores. Shared memory is an illusion afforded to each core via a messaging protocol to achieve cache coherence. From a bare metal perspective it is very easy to make an argument that shared memory is an illusion and a very leaky abstraction. One has to only consider the NUMA effects on modern servers with many sockets.

When I go for bare metal performance I only used shared memory as a means of message passing as this maps very cleanly to the cache coherence model I'm actually sitting on as a non-leaky abstraction.
 
Martin...

Aleksey Shipilev

unread,
Apr 13, 2014, 2:12:38 PM4/13/14
to mechanica...@googlegroups.com
On 04/13/2014 09:43 PM, Martin Thompson wrote:
> How did I give the impression I'm leaning towards peak performance? I'm
> only exploring the subject of parallel streams and FJ for if they meet
> their goals.

Parallel streams obviously meet their goals of providing the accessible
parallelism to users. FJP obviously meets its goals of providing the
foundation for that parallel work (validated by JDK 8 itself, Akka,
GPars, etc).

> Performance is a misdirection in this context. Going parallel is this
> context is about increasing utilisation of our modern multicore hardware.

Wait, what? Which context? I don't care about the utilization, I don't
think anyone cares about increasing the utilization unless you run the
power company billing the datacenter. I do care about performance though.

> Code brevity itself does not necessarily lead to increased business
> value. Just look at Perl against the criteria you list.

Your counter-argument is incorrect in assuming business value is linear
to code brevity. It is not: there is a sweet-spot of brevity which pays
off maximally well. You would argue this:

public void printGroups(List<People> people) {
Set<Group> groups = new HashSet<>();
for (Person p : people) {
if (p.getAge() >= 65)
groups.add(p.getGroup());
}
List<Group> sorted = new ArrayList<>(groups);
Collections.sort(sorted, new Comparator<Group>() {
public int compare(Group a, Group b) {
return Integer.compare(a.getSize(), b.getSize())
}
});
for (Group g : sorted)
System.out.println(g.getName());
}

...is more readable than this?

public void printGroups(List<People> people) {
people.stream()
.filter(p -> p.getAge() > 65)
.map(p -> p.getGroup())
.distinct()
.sorted(comparing(g -> g.getSize())
.forEach(g -> System.out.println(g.getName());
}


> Here you are saying business value is coming from parallel streams
> making things easier then later you say every technology "complicates
> the mental model". This feels like a contradiction.

If you re-read that thought carefully: every technology *does*
complicate the mental model, by sweeping unnecessary things under the
rug, but adding to the under the rug mess. The "common" usages, however,
are simplified at the expense of increased complexity elsewhere.

This is what I see in this thread: it is harder to bend parallel streams
to do *exactly* what you want low-level-wise, but that's only the price
for exposing the entire realm of Java developers to readable and
maintainable parallel code.

> For code to be maintainable it must be clear and easy to reason about.
> I think many would argue that larger scale apps built with FJ or
> Map-Reduce are not easy to maintain or debug.

And I would argue programming is hard. Not easy to maintain or debug
compared to what? Is there an option which makes solving the problems
FJ/MR systems are facing easier *without* sacrificing the benefits of
FJ/MR? (Hint-hint: you are not in the single-threaded Texas anymore).

> The statement is not invalidated by Akka. Akka is from the Scala
> community and not to be found in the JDK or JEE. Also FJP is only one of
> many possible ways of scheduling actors.

...and yet, FJP is their default high-performance executor.

> When I go for bare metal performance I only used shared memory as a
> means of message passing as this maps very cleanly to the cache
> coherence model I'm actually sitting on as a non-leaky abstraction.

That accurately describes the we-care-about-performance approach for
modern Java today: using, providing, and improving light-weight
inter-thread communication primitives [see e.g. entire j.u.c.*, other
lock-free stuff, fences, enhanced volatiles, etc]. Does that mean Java
community and core Java team is "open thinking in this area", contrary
to what you seem to be implying?

-Aleksey

Edward Harned

unread,
Apr 13, 2014, 2:28:12 PM4/13/14
to mechanica...@googlegroups.com
I wrote the article and I stand by it today.

For the life of me, I’ve never been able to figure out why Oracle pushes this F/J framework so hard. F/J is just a niche in parallel computing. Recursive decomposition is just a tiny niche in F/J. Yet they push this F/J framework as if it were the Queen of the Ball.

The JVM will probably never have a parallel engine like the one found in Microsoft’s .Net Framework for Parallel Programming. The Sumatra project may be ready for Java9 but integrating parallel streams to use GPU’s is not going to be easy, probably won’t be ready for Java9, and certainly won’t be usable by many laptop/desktop users. So look at the next best thing.

Instead of a fork-and-join F/J framework, use a scatter-gather F/J framework. No submission queues. No “continuation threads/continuation fetching.” No horrific complexity. In application development frameworks, flexibility, simplicity, error recovery, accountability, and documentation are the primary properties. Speed comes later.

If I can build a scatter-gather engine in a few months then so can they. In 2010 I built a proof-of-concept that work-sharing is simpler and works just as well as work-stealing without submission queues and an intermediate join(). The good professor ignored it and suggested I read the research papers on work-stealing. I took the parallel engine out of another F/J project I maintain and dropped in a scatter-gather engine. Today it performs superbly as the TymeacDSE project on SourceForge.
http://sourceforge.net/projects/tymeacdse/

The is an application server/service and not suitable for inclusion into the JDK which is why I never proposed it.

Dynamic decomposition works where recursive decomposition fails. Dynamic decomposition greatly simplifies stream parallelization. As you can see, it’s not that hard and it doesn’t take that long. If you build it so the caller, server, and application code are all separate then you can easily make the server local or remote.

For Java EE just have a Remote Procedure Call version. Set up a separate JVM just for RPC. With system properties in EE (and even SE) you can offload the multi-threading to a separate machine. The ability to do that would solve many problems in both places.

In a nutshell, start from the beginning. Build a parallel engine that can sustain Java into the future. A tiny niche product with severe restrictions trying to emulate a better methodology is not the way to go.

ed




On Sunday, April 13, 2014 3:38:49 AM UTC-4, Martin Thompson wrote:
The following article proposes that the FJ framework and parallel collections could be a calamity for Java. 


I've been uncomfortable with FJ for some time. I see people struggling to design for it, debug it, and more often fail to get performance benefits. Other approaches such as pipelining tasks can often be way more effective and easier to reason about.

I also find it amusing that after years of trying to hide databases behind ORMs that to use the parallel collections effectively you need to understand set theory for writing good queries.

The following blog shows just how bloggers can so easily misuse parallel collections by having no sympathy for CPU resource on a system. I think this is only the tip of the iceberg.


I'm curious to know if others have doubts about either Fork-Join or parallel collections, or if these are really good ideas and somehow the penny has not dropped for me? I'd really like to see a good evidence based debate on this subject.

Regards,
Martin...


Martin Thompson

unread,
Apr 13, 2014, 2:44:09 PM4/13/14
to mechanica...@googlegroups.com
On 13 April 2014 19:12, Aleksey Shipilev <aleksey....@gmail.com> wrote:
On 04/13/2014 09:43 PM, Martin Thompson wrote:
> How did I give the impression I'm leaning towards peak performance? I'm
> only exploring the subject of parallel streams and FJ for if they meet
> their goals.

Parallel streams obviously meet their goals of providing the accessible
parallelism to users. FJP obviously meets its goals of providing the
foundation for that parallel work (validated by JDK 8 itself, Akka,
GPars, etc)

"Obviously", where is the evidence? You may be right but you cannot make that statement yet.


> Performance is a misdirection in this context. Going parallel is this
> context is about increasing utilisation of our modern multicore hardware.

Wait, what? Which context? I don't care about the utilization, I don't
think anyone cares about increasing the utilization unless you run the
power company billing the datacenter. I do care about performance though.

Without efficient utilisation you do not get performance. You need to efficiently utilise the other cores to get the parallel speedup. Provided that speedup attempt does not suffer a costly coherency penalty as described by Universal Scalability Law.
Streams can absolutely improve code clarity for those who embrace set theory. This is a very separate argument from the use of parallel streams and FJ.  When *parallel* it is much harder to reason about and debug. 

As I've said repeatedly, and you don't acknowledge, that being able to reason about and debug code is a huge part of the maintenance cost. Syntax is one, but only one part, of that.
 
> Here you are saying business value is coming from parallel streams
> making things easier then later you say every technology "complicates
> the mental model". This feels like a contradiction.

If you re-read that thought carefully: every technology *does*
complicate the mental model, by sweeping unnecessary things under the
rug, but adding to the under the rug mess. The "common" usages, however,
are simplified at the expense of increased complexity elsewhere.

This is what I see in this thread: it is harder to bend parallel streams
to do *exactly* what you want low-level-wise, but that's only the price
for exposing the entire realm of Java developers to readable and
maintainable parallel code.

Going parallel on the same data can have benefits and does have consequences. There are alternatives that are sometimes a better fit. It is healthy to have diversity.
 
> For code to be maintainable it must be clear and easy to reason about.
>  I think many would argue that larger scale apps built with FJ or
> Map-Reduce are not easy to maintain or debug.

And I would argue programming is hard. Not easy to maintain or debug
compared to what? Is there an option which makes solving the problems
FJ/MR systems are facing easier *without* sacrificing the benefits of
FJ/MR? (Hint-hint: you are not in the single-threaded Texas anymore).

Absolutely programming is hard. It gets a lot harder when we go parallel or concurrent. That is why single threaded actors, or pipeline stages, are much easier to program. By using actors and pipelines we have an option for staying in single threaded Texas (Kansas).

> The statement is not invalidated by Akka. Akka is from the Scala
> community and not to be found in the JDK or JEE. Also FJP is only one of
> many possible ways of scheduling actors.

...and yet, FJP is their default high-performance executor.

Without out core local memory, thread affinity, or other primitives that is the best they can hope for.
 
> When I go for bare metal performance I only used shared memory as a
> means of message passing as this maps very cleanly to the cache
> coherence model I'm actually sitting on as a non-leaky abstraction.

That accurately describes the we-care-about-performance approach for
modern Java today: using, providing, and improving light-weight
inter-thread communication primitives [see e.g. entire j.u.c.*, other
lock-free stuff, fences, enhanced volatiles, etc]. Does that mean Java
community and core Java team is "open thinking in this area", contrary
to what 

What you list above is all about concurrent programming and dealing with the issues of sharing memory.  To have less contention on shared memory things like stack allocation, core local memory, and memory layout control could be argued for being on the list.

Martin Thompson

unread,
Apr 13, 2014, 2:57:36 PM4/13/14
to mechanica...@googlegroups.com
How do you ensure all thread pools in the various app servers, 3rd party libs, and the parallel streams all play well together? I think it is a tricky problem because we then have a centralised pool it is a contention point plus issues like unhandled exceptions and rejected executions become tricky. We need to work out a clean way of surfacing the issues so developers don't fall into a trap that makes their life worse rather than better.

I've seen a lot of talks on how we can use streaming APIs which are great and it can be really nice. You're own talk being a great example of it done well :-) I just think is is fair to explore the implications and consequences which is more boring but just as important.

Aleksey Shipilev

unread,
Apr 13, 2014, 3:30:47 PM4/13/14
to mechanica...@googlegroups.com
On 04/13/2014 10:44 PM, Martin Thompson wrote:
> On 13 April 2014 19:12, Aleksey Shipilev <aleksey....@gmail.com
> "Obviously", where is the evidence? You may be right but you cannot make
> that statement yet.

The blog links you were posting are the evidence for that: users get
parallel speedups with parallelStream(). Since that code uses FJP to
achieve those speedups, it validates the use of FJP.

But you want something else? You want it to deliver speedups in all the
cases? (To quote yourself, being unable to "so easily misuse parallel
collections by having no sympathy for CPU resource on a system").

Now if you think there are better options, the burden of proof is on
you. Can you beat the FJP-backed parallelStream() performance with
non-FJP-backed actors and/or pipelines in similar scenarios?

> Without efficient utilisation you do not get performance. You need to
> efficiently utilise the other cores to get the parallel speedup.

Um, no? Utilization is tangential to performance. I don't have to
"efficiently" utilize the cores to get the speedup (note you mix
"speedup" and "parallel speedup" freely, but these are not the same), I
just have to use the cores... sometimes. For example, the non-obvious
thing for FJP and Streams is that there are clear cases where it is
better *not* to use the core and stay local for short tasks (this is
where execute-in-submitter thing was born from -- contrary to the belief
that those bookworm academicians are here to kill us all).

> Streams can absolutely improve code clarity for those who embrace set
> theory.

Oh. I guess programming is even harder for alphabet deniers. Seriously,
Martin! I stopped reading after this line.

-Aleksey.

Martin Thompson

unread,
Apr 13, 2014, 4:46:49 PM4/13/14
to mechanica...@googlegroups.com
On 13 April 2014 20:30, Aleksey Shipilev <aleksey....@gmail.com> wrote:
On 04/13/2014 10:44 PM, Martin Thompson wrote:
> On 13 April 2014 19:12, Aleksey Shipilev <aleksey....@gmail.com
> "Obviously", where is the evidence? You may be right but you cannot make
> that statement yet.

The blog links you were posting are the evidence for that: users get
parallel speedups with parallelStream(). Since that code uses FJP to
achieve those speedups, it validates the use of FJP.

But you want something else? You want it to deliver speedups in all the
cases? (To quote yourself, being unable to "so easily misuse parallel
collections by having no sympathy for CPU resource on a system").

We have been talking about how people will be able to cope with this. That is development effort and maintenance. It is all about how one can reason about code.

The evidence I'm looking for is can people easily reason about this code compared to alternatives. Where sometimes the alternative is just single threaded code. Let us be clear this is not about streams and their API, it is about the potential scale up of parallel streams and FJ, with the context of the implications of adopting such techniques in the full context of software delivery.
 
Now if you think there are better options, the burden of proof is on
you. Can you beat the FJP-backed parallelStream() performance with
non-FJP-backed actors and/or pipelines in similar scenarios?

Erlang
 
> Without efficient utilisation you do not get performance. You need to
> efficiently utilise the other cores to get the parallel speedup.r

Um, no? Utilization is tangential to performance. I don't have to
"efficiently" utilize the cores to get the speedup (note you mix
"speedup" and "parallel speedup" freely, but these are not the same), I
just have to use the cores... sometimes. For example, the non-obvious
thing for FJP and Streams is that there are clear cases where it is
better *not* to use the core and stay local for short tasks (this is
where execute-in-submitter thing was born from -- contrary to the belief
that those bookworm academicians are here to kill us all).

If you think utilisation is not directly related to performance then Amdahl, Erlang, LIttle, Gunther and many others must all be wrong and you are right.
 
> Streams can absolutely improve code clarity for those who embrace set
> theory.

Oh. I guess programming is even harder for alphabet deniers. Seriously,
Martin! I stopped reading after this line.

It seems we will have to differ. I think there is value in FJ and parallel streams, there are also implications and consequences. You seem to think it is the only thing and that is fine. 

Michael Barker

unread,
Apr 13, 2014, 5:15:02 PM4/13/14
to mechanica...@googlegroups.com
Is anyone on the list using FJ or parallel streams in production to get real speed-ups on solving parallel problems (and not just as a better executor, a-la Akka)?  Be interested in hearing real-world experiences.

Personally we've just removed the only piece of software from our production environment that was just FJ and replaced it with a message passing/request based concurrency solution, but the problem in question didn't really fit the parallel model particularly well.  So that is not really a comment on FJ, more about using the wrong tool for the job.  I'd like to hear about the cases where it is the right tool.

Mike.


--

Aleksey Shipilev

unread,
Apr 13, 2014, 5:54:52 PM4/13/14
to mechanica...@googlegroups.com
On 04/14/2014 12:46 AM, Martin Thompson wrote:
>
> On 13 April 2014 20:30, Aleksey Shipilev <aleksey....@gmail.com
> <mailto:aleksey....@gmail.com>> wrote:
>
> On 04/13/2014 10:44 PM, Martin Thompson wrote:
> > On 13 April 2014 19:12, Aleksey Shipilev
> <aleksey....@gmail.com <mailto:aleksey....@gmail.com>
> > "Obviously", where is the evidence? You may be right but you
> cannot make
> > that statement yet.
>
> The blog links you were posting are the evidence for that: users get
> parallel speedups with parallelStream(). Since that code uses FJP to
> achieve those speedups, it validates the use of FJP.
>
> But you want something else? You want it to deliver speedups in all the
> cases? (To quote yourself, being unable to "so easily misuse parallel
> collections by having no sympathy for CPU resource on a system").
>
>
> We have been talking about how people will be able to cope with this.
> That is development effort and maintenance. It is all about how one can
> reason about code.

Ah, you conflate the reasoning about the code correctness (I submit this
is what most people mean under "reasoning about the code"), and
understanding the performance model of the code (this *may* be called
reasoning, but this will not resonate with most people, including me --
since most people don't care about performance provided it is not
horrendously bad).

I submit that reasoning about the code is greatly simplified with
Streams. Performance model? Not so much. As I keep saying, the
performance model for *any* parallel application is rather complex, and
gets even more complex as you try to abstract things. I further submit
that no abstraction is immune from that, either parallel streams,
actors, pipelines, or whatever other thing there is -- it is a tradeoff
for making common use case more appealing.

> The evidence I'm looking for is can people easily reason about this code
> compared to alternatives. Where sometimes the alternative is just single
> threaded code. Let us be clear this is not about streams and their API,
> it is about the potential scale up of parallel streams and FJ, with the
> context of the implications of adopting such techniques in the full
> context of software delivery.

This problem (e.g. predicting performance and guiding optimization) is
widely unsolved even for sequential code (algorithmic complexities vs.
real life performance, anyone?). It is not magically solved for parallel
code either, but the approach is the same: you start off getting
empirical data to construct the models which are fitting that data, and
from there decide if the predicted behavior from the model is what you need.


> Now if you think there are better options, the burden of proof is on
> you. Can you beat the FJP-backed parallelStream() performance with
> non-FJP-backed actors and/or pipelines in similar scenarios?
>
>
> Erlang

That's not the answer, Martin. Verifiable experiments, please. You know,
those things that produce "evidence" you want in this thread. P.S.
Erlang/OTP is a funny non-argument in this discussion for two reasons:
a) it uses work-stealing executors as well, hm; b) somehow Akka is not
an example, but Erlang is, hm.

> > Streams can absolutely improve code clarity for those who embrace set
> > theory.
>
> Oh. I guess programming is even harder for alphabet deniers. Seriously,
> Martin! I stopped reading after this line.
>
>
> It seems we will have to differ. I think there is value in FJ and
> parallel streams, there are also implications and consequences. You seem
> to think it is the only thing and that is fine.

I fail to see how my "of course it has shortcomings, because that's how
the world works: there are always tradeoffs", "every technology and
every improvement complicates the mental model", and "it is harder to
bend parallel streams to do *exactly* what you want low-level-wise, but
that's only the price for exposing the entire realm of Java developers
to readable and maintainable parallel code" -- can be treated as me
thinking "it is the only thing".

-Aleksey.

Jin Mingjian

unread,
Apr 13, 2014, 11:42:53 PM4/13/14
to mechanica...@googlegroups.com
I am coming late:) I think the discussion are becoming to a combat of literature, although it may be part of this kind discussion.

I read the first version of that FJ calamity article(OK, Edward come in) in the past year. My own understanding is, some words of this article are a little much subjective, suck as "academic exercise"; some others can hit some points.

And again, such as "For 99.9999% of businesses peak application performance is the second order concern" should be avoided as its strong subjective color unless you have at less a public poll to support this. As Joshua Bloch "suggested", the performance is always important for library authors. There are truly trades-offs. But when your businesses needs it, it matters. If to see the performance aspect on the top of timeline, not in a static point, on the contrary,  many "of businesses peak application performance" would be a best concern in one time point, IMHO.

One word from Martin I really like is "measuring". Microbench has its problem indeed. But if we have not any measuring results, it is not much sense to draw any conclusions from the discussion. 

I've done some experimental investigation to FJ side works ago. Here are my 5c:

1. the detail implementation of FJ is very very complex. 
I personally feel only Doug Lee has the ability to maintain it. But this is just a engineering flaw of FJ. If you meet a bug in FJ with your critical system, it may be harder to workaround or fix it soon. But you need in that it is in your critical core...

2. I benched the external task submissions of FJ in the favor of async mode. 
The FJ's performance is slower around 5x(the largest is 10x, but does not always happen) than that of my lockfree MPMC powered thread pool in a simple benchmark, which gives how long a batch of simple tasks takes from submitting to finishing to measure the overhead of pool framework itself. The design is just prototyping kind, without any optimization, and far from my ideal thoughts. But the result makes me more open to the outside of FJ.

3. AKKA's FJ usage can not show that the FJ is best common pool design. 
Because there are not strong competitors in the concurrency libraries design. The new-coming library authors just say "you see FJ based AKKA is trendy, we can use it". But we have not seen any measuring to say why.

4. Finally, #3 uncovers a key: the diversity of current Java concurrency libraries now is not enough, as Martin hinted. I'd like to see more works in this field in the future.


best regards,
Jin...






Aleksey Shipilev

unread,
Apr 14, 2014, 3:24:20 AM4/14/14
to mechanica...@googlegroups.com
On 04/14/2014 07:42 AM, Jin Mingjian wrote:
> One word from Martin I really like is "measuring". Microbench has its
> problem indeed. But if we have not any measuring results, it is not much
> sense to draw any conclusions from the discussion.

If your benchmarks are flawed, you can not draw *any* conclusion from
them, because you are looking at garbage data. I understand people may
see things even in the white noise, but this is not "measuring", this is
"guessing".

> I've done some experimental investigation to FJ side works ago. Here are
> my 5c:
>
> 1. the detail implementation of FJ is very very complex.
> I personally feel only Doug Lee has the ability to maintain it. But this
> is just a engineering flaw of FJ. If you meet a bug in FJ with your
> critical system, it may be harder to workaround or fix it soon. But you
> need in that it is in your critical core...

That's true for any other system which abstracts things. Write the code
straight in machine assembly. Wait, there are also bugs in CPUs
themselves...

Of course, FJP code is complex, but that is only because it struggles to
attain consistent performance on the platforms where it is hard to do
otherwise. (Unfortunately, this includes Java and the variety of JVMs --
that means most of the code there is heavily optimized to break the
reliance on smart compiler coming and fixing everything). This is hardly
an "engineering flaw", this in an engineering excellence to hide all
these complexities in the library -- that's what libraries are for.

But you can only appreciate this if/when you are dealing with these
issues yourself in, say, writing your own thread pool. This is a good
exercise, and if diversity matters for you, get on writing the FJP
killer. If that FJP killer is close in performance for the similar
scenarios, but is much less complex, it would be silly not to adopt it.

> 2. I benched the external task submissions of FJ in the favor of async
> mode.
> The FJ's performance is slower around 5x(the largest is 10x, but does
> not always happen) than that of my lockfree MPMC powered thread pool in
> a simple benchmark, which gives how long a batch of simple tasks takes
> from submitting to finishing to measure the overhead of pool framework
> itself. The design is just prototyping kind, without any optimization,
> and far from my ideal thoughts. But the result makes me more open to the
> outside of FJ.

Two things:
a) Where are your benchmarks? Are your results peer-reviewed? ;)
b) Comparisons of prototype code which obviously has bugs due to a lack
of wide testing vs. production-grade code used everywhere and dealt with
lots of non-obvious real world quirks may be not telling anything useful
at all

> 3. AKKA's FJ usage can not show that the FJ is best common pool design.

This is from the bookworm me: in the open world, you can not show
*anything* is the best unless you try all other infinite possibilities.
Hence, this is the one of the questions which is unanswerable by
construction.

> Because there are not strong competitors in the
> concurrency libraries design. The new-coming library authors just
> say "you see FJ based AKKA is trendy, we can use it". But we have not
> seen any measuring to say why.

I dunno, this?
http://letitcrash.com/post/17607272336/scalability-of-fork-join-pool

Or, this?
http://shipilev.net/talks/jeeconf-May2013-forkjoin.pdf (in Russian, but
graphs are pretty much self-explanatory).

> 4. Finally, #3 uncovers a key: the diversity of current Java concurrency
> libraries now is not enough, as Martin hinted. I'd like to see more
> works in this field in the future.

The mere wish is not enough, and the "lack of competition" may as well
indicate the competition can not even compete. But you can prove me
wrong by providing the implementation which is making better tradeoffs
than FJP does; since I know how much engineering is put into FJP, I
doubt you can provide one in nearest 3 years, even if you have the
experience of Doug Lea himself.

-Aleksey.

Sergey Zubarev

unread,
Apr 14, 2014, 4:50:37 AM4/14/14
to mechanica...@googlegroups.com
Hi, Aleksey!

Read your slides regarding FJ, quite interesring.
What the tool you used to get pictures like on page 25 and 35?

And one question about wakekup time for one thread on page 43 (which is ~50 usec)?
Does it really mean that for example Semaphore.release() call takes 50 usec?

-
Sergey Zubarev

Aleksey Shipilev

unread,
Apr 14, 2014, 4:59:06 AM4/14/14
to mechanica...@googlegroups.com
On 04/14/2014 12:50 PM, Sergey Zubarev wrote:
> Read your slides regarding FJ, quite interesring.
> What the tool you used to get pictures like on page 25 and 35?

There is a link at the latest slide.

> And one question about wakekup time for one thread on page 43 (which
> is ~50 usec)? Does it really mean that for example
> Semaphore.release() call takes 50 usec?

More like "unparking" from Semaphore.acquire() takes that much.

But you can think about it as the latency which is here when you try to
unpark the other thread. That is OS wakeup lag, and you can only do so
much at the user level, except for waking up more threads preemptively
(this is clearly seen by four pool threads at #45, but it only helps
that much) while starting executing the task in submitter (which is
shown at #46, where only 1/3 of the time the pool was woken up
completely, but submitters are working in that shadow nevertheless).

-Aleksey.

tech...@andypiper.com

unread,
Apr 14, 2014, 5:42:30 AM4/14/14
to mechanica...@googlegroups.com
Wow what a firestorm! I only asked the question on twitter because in the dim and distant past (20+ years ago) I did research into divide-and-conquer as a general-purpose paradigm for parallel computing. I find it amusing to see that there is indeed nothing new under the sun in computer science, I was even at a conference last week where one of the speakers suggested that functional languages were going to take over the world because they map well to parallel architectures - no **** sherlock, 20 years ago they were saying the same things and instead we got .... Java.

I don't have any recent experimental evidence, I can only offer my opinions from the past

- Anything that encourages micro-parallelism is doomed to failure because coordination/communication costs will kill you.
- Explicit parallelism is easy to implement, but no-one really wants explicit parallelism (unless that is the sole purpose of your environment e.g. Hadoop) because it's too distracting in the context of solving problems.
- Syntactic sugar gave us operator overloading in C++. I think it's too early to tell whether lambdas and streams will lead to better software but proof of less typing is a pretty meaningless metric IMO.

andy

Martin Thompson

unread,
Apr 14, 2014, 7:55:04 AM4/14/14
to mechanica...@googlegroups.com
I'm sorry for my part in this thread becoming a bit of a fire storm. That was not my intention. 

I think parallel streams and FJ have value. I want to have a healthy discussion about how to best use them, what to watch out for, and what are the possible alternatives. Thank you Remi for the useful lead on how we consider getting the FJP to collaborate with other pools configured on a system. We need further discussion on what granularity works for FJ and how it can be debugged. So many of the stream examples have tasks much smaller than FJ provides value for. Doug has said himself that unless a task is more than a few hundred cycles of work then divide-and-conquer does not pay for itself. From personal experience I've also found that the scheduling cost is greatly increased in virtualised OS environments where this threshold can become thousands of cycles.

Ultimately, we are trying to work out how to improve what our software can delivery on modern hardware. I'd like to leave the term performance out of it because it is too much of an umbrella. We should focus more on response time, throughput, and scalability because they can have units applied to them and measured.

Andy I think your point of micro-parallelism is well made but missing at least one possibility. SIMD can provide for micro-parallelism without the coherence costs. Many of the examples of divide-and-conquer doing well are filtering, sorting, reducing, etc. SIMD can add a lot of value here without the scheduling issues of going across threads. Each generation of our CPUs can do more work per cycle using vecorization instructions. Even languages like C#, Dart, and JavaScript are getting implicit and explicit support.
 
SIMD and a number of other features missing from core Java raise an interesting question about innovation. We have to accept that the best way for innovation to occur is for the wider community to be involved. If the only primitives added are to support the mindset of the core committers then innovation is stifled and the environment for innovation is limited. I belive this is why we don't have alternatives to FJP for scheduling Akka. However that is just my view.

If we had the likes of x86 PAUSE, CPUID, core local memory, stack allocation, and thread affinity then other concurrent approaches could flourish. It is great that we are now getting XADD and XCHG which have been sorely missing for a long time. This thinking can also be applied to memory layout. We have to do a lot of complex coding to get a 10X throughput increase by going parallel. Yet with control of memory layout it is easy to get a greater than 10X throughput increase for core data structures like maps, lists, and trees without complicating the programming model by introducing parallel or concurrent concepts. Our CPUs spend most of their time idle due to not being feed. Cache missing and coherence costs are the main cause of this.

When I say I'd like more open minded thinking and diversity from the core Java team what I mean is a consideration for how innovation is fostered in the greater community. Being in the core and having influence on what key building blocks are available to support innovation is a big responsibility that must start with listening and showing empathy to a passionate community.

Best Regards,
Martin...



Aleksey Shipilev

unread,
Apr 14, 2014, 8:32:40 AM4/14/14
to mechanica...@googlegroups.com
On 04/14/2014 03:55 PM, Martin Thompson wrote:
> I'm sorry for my part in this thread becoming a bit of a fire storm.
> That was not my intention.

Ditto for me.

> I think parallel streams and FJ have value. I want to have a healthy
> discussion about how to best use them, what to watch out for, and what
> are the possible alternatives. Thank you Remi for the useful lead on how
> we consider getting the FJP to collaborate with other pools configured
> on a system. We need further discussion on what granularity works for FJ
> and how it can be debugged.

This was answered a number of times by us, see e.g. Paul reciting a
simple C/P/N/Q performance model:
http://pppj2013.dhbw.de/fileadmin/user/public/Dokumente/PPPJ13/Lambdas_in_Java_8_-_Paul_Sandoz.pdf.


* C being the number of clients submitting to the pool improves
performance a bit because FJP has always work to do then (ahoy submitter
queues and balancing magic in FJP!), and then we concentrate on the
hardest case of C=1

* Performance of FJP-backed Streams largely depends on N*Q product,
where N is number of elements in the Stream, and Q is relative cost of
operation through the pipeline (this seems obvious in hindsight, but we
started off verifying without assuming that prior). We found that
break-even for parallel-vs-sequential streams with C=1 are somewhere
around N*Q = [200; 400] us; and most of the performance work within FJP
and Streams was focused around bringing the break-even lower. The mere
fact the library and runtime can not estimate Q reliably marks the
decision to push for explicit parallelism instead of implicit, no matter
how low we can make the break-even.

* P being the number of available CPUs is improving the performance
linearly with large N*Q or large enough C (checked up to 128 cores --
ahoy work-stealing and forked dissemination of local queues!), and
sub-linearly with close-to-breakeven N*Q with lower C (ahoy pool wakeup
mechanics and execute-in-submitter).

Knowing those gory details really makes me thinking that the idea of
replacing FJP in this task because of the "disadvantages", and without
having a clear alternative, is the manifestation of local Dunning-Kruger
effect.

> When I say I'd like more open minded thinking and diversity from the
> core Java team what I mean is a consideration for how innovation is
> fostered in the greater community. Being in the core and having
> influence on what key building blocks are available to support
> innovation is a big responsibility that must start with listening and
> showing empathy to a passionate community.

Yes, modulo two things:
a) It would be bad to stay in the echo chamber when all the
high-performance people in this room will tell you some feature is badly
needed. This may very well distort our own reality that other 6+ million
Java developers and users are non-existent. Breaking the programming
model is something you should be very cautious with the user base like that.
b) There should be no empathy to the trolls which list disadvantages
without listing the advantages, and vice versa. I get furious when that
happens (as you can tell).

-Aleksey.

Nitsan Wakart

unread,
Apr 14, 2014, 9:07:45 AM4/14/14
to mechanica...@googlegroups.com
I'm so very late to the party, thank you all for an entertaining and enlightening display!
Ignoring the pent up aggression and friendly slap exchanges, my takeaways/comments:
- FJ serves a usecase, I agree with Shipilev that being used is a clear indication of usefulness. The rate of conversion of other software (JDK internals apart) from the old thread pools/executors to the FJ will show if this is a trend. It's not true to say there's no competition, I've seen many in-house variations on executors/thread pools, maybe FJ can remove the need for those.
- FJ does not serve all usecases, as pointed out by all. It would be good of people to highlight the failed usecases (mystery use case from LMAX, very small(<X00ns?) tasks from Doug, anyone else?). As always, someone will not get what they wanted, and they are better off knowing upfront.
- Shipilev is asking for data, I agree with the sentiment. To enable contribution and comparison I think Oracle should open the benchmark suite used to validate the framework. If these are already open, where are they?
- There is a wider question here about the Streams construct as opposed to the FJ executor. The idea of Streams as a means of submitting work to be made parallel by some unknown underlying generic executor does not seem right to me, I have nothing to back this dislike apart from similar experiences with thread worker pool. I would be very curious to see how the benefits of this approach have been quantified. Maybe I should RTFM... - Martin/Kirk are saying a pipline model is a better model than the Streams model, this statement is backed by their consulting experience (is it too niche?). I would like some concrete examples which would enable people to see if the shoe fits to their usecase.
- On lambdas I agree with Mr. Piper, time will tell. Lambdas are a new feature to be used and abused.
- The call from Shipilev along the lines of "If you don't like it, build an alternative" is slightly unfair to the spirit of community feedback and process. The choice to implement FJ/Streams should be justified compared to other available approaches at the time (no point in blaming people for not seeing the future). If on the other hand community feedback is not important, then this point is moot. If an alternative has to be built is that not a failure?

William Louth

unread,
Apr 14, 2014, 12:05:00 PM4/14/14
to mechanica...@googlegroups.com
Hi Remi,

[not intending to stoke the fire anymore than it has already]

I hold the view that thread pool management (sizing, partitioning,...) is not the problem or the solution here though I recognize that rampant thread creations incurs some capacity and to a lesser extent compute costs (when idle). I don't believe that developers or operations will ever correctly configure or size such things so far better to have them appear unlimited but that is not to say that they should run (execute) in an unlimited fashion. The problem with threads pools is that they are pretty much disconnected from the activities (flows/paths/request/....) that they must perform. It is only when they do perform some work and after they have left the thread pool that we can discern what it is they are tasked with doing, what is the expected duration of such reservation & leasing of the "working" capacity. We use thread pools to control the ability of the system to do work (light or heavy) when what we should be doing is placing adaptive control mechanisms, valves, within the code itself which given the much needed context allow the runtime to automatically police and shape the call traffic flow through the system for better throughput, response time, stability,...Yes the system measures what it manages as well as how effective it is in achieving this. Reservation and resource management policies should be code/context specific and. Service classification (found in network management) we can coordinate concurrent thread execution across different method/code blocks by way of association with the services then linked to resources pools, pools that are virtual but represent tangible ways for us (man or machine) to influence the overall dynamics of the system.

In short don't restrict threads or size pools but instead police and shape how and when they perform work based on service classification (code+ctx->classification) and resource management policies largely based on reservation & leasing that can autonomically adapt capacities and prioritizations.

What the Java platform needs is a resource management API that allows it to adapt in a way that appears intelligent in the sense that it relates to context of the tasks being performed? A QoS for thread call/flows.


This approach I'm proposing has been put to various team members in the past but no takers though I do occasionally here of someone in Oracle mentioning QoS but sadly in a very different, and what I consider, incorrect context.

Rüdiger Möller

unread,
Apr 14, 2014, 12:54:56 PM4/14/14
to mechanica...@googlegroups.com
I'd agree with Martin here. Most of the time pipelining just feels more natural to apply when paralellizing work. FJ just does not fit for many use cases, though one may "make it fit". Anyway, maybe my brain just has not adapted yet :-)

Additionally, pipelining can be implemented allocation-free and tests I have done indicate the break even for parallel computation comes much later when using FJ due to inherent overhead of this approach (have not done exact benchmarks isolating this fact).

e.g. Akka using FJ is worse compared to Disuptor/Thread Pool based actors when reducing compute-job-size

(scroll to bottom for charts)

Ofc there are other factors contributing in this test, however FJ vs Disruptor/Thread Pools probably is one of them.

Jin Mingjian

unread,
Apr 15, 2014, 12:39:06 AM4/15/14
to mechanica...@googlegroups.com
Take it easy and thanks, Aleksey:) Most of your comment are right, although I may be not too much wrong as well:) 

Only thing is that, I do not deny the "engineering excellence" of Doug, I suspect that how many other people can reach the "engineering excellence" of Doug in whole community. I know you have deepened in FJ. All you presentation I have reviewed(including the Russian version), they are great to understand FJ. And I think you may have the ability to fix the bugs in FJ. Again assumed that Oracle has six or ten people can do this. However, I still personally can *not* believe that it is an "engineering excellence" which only six or ten people can handle for a top-popular language runtime. 

Yes. I have not codes shown here. Because I even show here, someone can say it is trivial to for lacking of engineering considerations. It is true. So, I just share some ideas of me that I have not found in this discussion. I suppose this is still OK:) For more codes, I think that Rüdiger's work is similar to my experiment. But again, AKKA's FJ usage may not be the best usage of FJ. This is that we need to be educated by your guys:)

Finally, my personal recent technical thought is that, if it is hard to implement an common async pool with nice trade-offs, the alternative is to setup some mechanisms to facilitate user to design their own dedicated pool. Hiding details does not always work for all developers or all aspects, otherwise why the mechanical sympathy here?:)

best regards,
Jin...





-Aleksey.

Aleksey Shipilev

unread,
Apr 15, 2014, 4:05:58 AM4/15/14
to mechanica...@googlegroups.com
On 04/15/2014 08:39 AM, Jin Mingjian wrote:
> However, I still personally can *not* believe that it is an
> "engineering excellence" which only six or ten people can handle for
> a top-popular language runtime.

Ya. How many people can maintain, say, RCU in Linux Kernel? The world is
complicated, and you have to face the worlds' complexity at
some level. It is not the engineering fault of those six-ten people that
the world is busy building off Candy Crush clones instead of
participating in hardcore systems programming. Systems programmer layer
is very-very thin in mostly all the places (recent Heartbleed fiasco is
just another reminder of that), or, as Remi says, we are "short of
skilled people willing to help".

To sprinkle some more hardcoredness into this, here's the big picture
from DL himself:
http://pldi12.cs.purdue.edu/sites/default/files/slides_pldi12-dlea.pdf
(the thing I always forget to point out in FJP discussions, is at Doug's
slide #12).

-Aleksey.

tech...@andypiper.com

unread,
Apr 15, 2014, 5:47:30 AM4/15/14
to mechanica...@googlegroups.com

On Monday, April 14, 2014 12:55:04 PM UTC+1, Martin Thompson wrote:
Andy I think your point of micro-parallelism is well made but missing at least one possibility. SIMD can provide for micro-parallelism without the coherence costs. Many of the examples of divide-and-conquer doing well are filtering, sorting, reducing, etc. SIMD can add a lot of value here without the scheduling issues of going across threads. Each generation of our CPUs can do more work per cycle using vecorization instructions. Even languages like C#, Dart, and JavaScript are getting implicit and explicit support.


Yes agreed. I was going to add "... unless supported in hardware" but that didn't seem controversial enough ;)

My concern with fine-grained explicit parallelism is that in the large (i.e. not benchmarking but actually solving problems) people will end up using it too much and end up with a mess that actually goes slower. It's self evident that decoupled coarse grained chunks of work will perform much better than highly coupled smaller pieces - that's why MapReduce works so well - my concern is that unless you have that mindset you will fail. Time will tell.

andy

andy

Rüdiger Möller

unread,
Apr 15, 2014, 2:32:30 PM4/15/14
to mechanica...@googlegroups.com
Its sad this discussion somehow has messed up in flames, as multithreading is definitely a major area of trouble in Java currently. 
I just made a quick sum up of concurrency related issues we had in a 7-15 developer team soft realtime application (clustered inmemory data grid + GUI front end)
  • 95% of threads created are not to improve throughput but to avoid blocking (e.g. IO, don't block messaging receiver thread, avoid blocking the event thread in a GUI app, ..).
  • developers love synchronous code, async callback patterns are avoided (one big reason is the syntax clutter created when using anonymous classes [even worse if nested]). Callers of such synchronous code in turn create threads to "asynchronify" sync methods/functionality. Sometimes this cascades.
  • The <5% part where multi threading was introduced for performance reasons (message de/encoding, querying in-memory datasets), where the least problematic parts. We use pipelining and ordinary fixed thread pools when applying sharding/striping patterns. Usually the concurrent design in these cases is encapsulated, so rarely trouble here
Major issues:
  • Its very hard looking at a piece of source to figure out which threads (if any) may enter this. It requires global awareness of the thread architecture and the different places a method may get called
  • Side effects introducing concurrency: Some developer decides to do something in-parallel (e.g. throw in '.parallel().') which used to be single threaded. Since his call indirectly may walk half the framework, all over a sudden a lot of single threaded code is multithreaded.
Imo, we need patterns to control and contain concurrency
Java 8 improvements imo address a non-existent use case: speed up vanilla business code by (uncontrolled) introduction of concurrency. This won't be practical in large real world apps, as you completely lose control regarding concurrent execution and concurrent data access, which will blow up your application for sure .

If I want to multithread for performance, I am better off using a pipelining scheme, as the intrinsic overhead of lambda/FJ is high compared to a pipelining (or well implemented actor) approach. Reality is, some 100 cycle tasks (referred as 'micro paralellizing' above) is the default. One rarely has 'big' parallelizable computing tasks in todays business applications.

So boiled down there are two very distinct requirements:
  • organize concurrent execution of business code in a safe, understandable and maintainable way 
  • provide utilities for high performance parallel computational tasks

Aleksey Shipilev

unread,
Apr 15, 2014, 2:37:07 PM4/15/14
to mechanica...@googlegroups.com
On 04/15/2014 10:32 PM, Rüdiger Möller wrote:
> If I want to multithread for performance, I am better off using a
> pipelining scheme, as the intrinsic overhead of lambda/FJ is high
> compared to a pipelining (or well implemented actor) approach.

I would really like to see a verifiable study backing this claim.

-Aleksey.

Rüdiger Möller

unread,
Apr 15, 2014, 2:51:20 PM4/15/14
to mechanica...@googlegroups.com
You are right, but there is some evidence. FJ is at disadvantage because it requires object alloc and queuing. Someone should write a benchmark comparing e.g. disruptor and FJ in computing Pi :-). I have already done the disruptor bench (actually MikeB did the good one), so only a pure fork join is needed. I'd estimate base overhead of FJ is at least 2 times higher, but that's a pure guess. 

Aleksey Shipilev

unread,
Apr 15, 2014, 3:06:03 PM4/15/14
to mechanica...@googlegroups.com
Granted, I have a very high bar for what constitutes "evidence", and the
so far it does not sound verifiable enough to me, especially when
someone uses the words "estimate" and "pure guess" in the same sentence
;) Publish the benchmark code, maybe?

-Aleksey.
> --
> You received this message because you are subscribed to the Google
> Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to mechanical-symp...@googlegroups.com
> <mailto:mechanical-symp...@googlegroups.com>.

Aleksey Shipilev

unread,
Apr 15, 2014, 3:15:04 PM4/15/14
to mechanical-sympathy
I remembered to see something alone those lines already in this thread, you probably mean this experiment? 


-Aleksey.

Rüdiger Möller

unread,
Apr 15, 2014, 3:18:11 PM4/15/14
to mechanica...@googlegroups.com
I just did an estimation to brag about it later on once I wrote a benchmark proving it (in case things go the other way, I'll stick to some random j2ee forum for a while ;-) ).

the benchmark code for disruptor bench is avaiable at 
https://github.com/RuedigerMoeller/abstractor/tree/master/src/test/java/de/ruedigermoeller/abstraktor/sample
(the full project is probably somewhat messed up. Attic conatins the akka bench).

the snippets listed at http://java-is-the-new-c.blogspot.de/2014/01/comparision-of-different-concurrency.html bottom should also be mostly standalone compileable.

Aleksey Shipilev

unread,
Apr 15, 2014, 4:46:52 PM4/15/14
to mechanica...@googlegroups.com
I took the "Disruptor (optimized)" as the base for my benchmark, because
it appears simpler. Can you eyeball if I screwed something up while
porting that benchmark under JMH?

Here's the project:
https://github.com/shipilev/disrupting-fjp

Here's the Disruptor benchmark:
https://github.com/shipilev/disrupting-fjp/blob/master/src/main/java/net/shipilev/Disruptor.java

Preliminary runs on my 2x2 i5 Sandy Bridge, Linux x86_64 (as I don't
want to waste lab time on broken benchmarks), JDK 8 GA... Fork/Join
appears faster:

Benchmark Mean Mean error Units
n.s.Disruptor.run 780.561 10.392 ms/op
n.s.ForkJoin.run 652.316 35.723 ms/op
n.s.ForkJoinRecursive.run 566.348 4.551 ms/op
n.s.ForkJoinRecursiveDeep.run 597.236 3.289 ms/op

I checked the number of calculatePi calls is the same, so these
benchmarks process the same amount of work. But I still wonder if
something is screwed with those benchmarks. For example, Disruptor is
really high at sys%, which gets me unsettled about its result.

-Aleksey.
> > an email to mechanical-symp...@googlegroups.com
> > <mailto:mechanical-symp...@googlegroups.com>.
> > For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to mechanical-symp...@googlegroups.com
> <mailto:mechanical-symp...@googlegroups.com>.

Michael Barker

unread,
Apr 15, 2014, 4:55:04 PM4/15/14
to mechanica...@googlegroups.com
I get the following exception when I try to run the test (Disruptor pass runs okay):

This is with JDK 1.7.0_40.

java.lang.ClassCastException: java.lang.Thread cannot be cast to java.util.concurrent.ForkJoinWorkerThread
at java.util.concurrent.ForkJoinTask.fork(ForkJoinTask.java:622)
at java.util.concurrent.ForkJoinTask.invokeAll(ForkJoinTask.java:684)
at net.shipilev.ForkJoinRecursiveDeep$PiForkJoinTask.compute(ForkJoinRecursiveDeep.java:52)
at net.shipilev.ForkJoinRecursiveDeep$PiForkJoinTask.compute(ForkJoinRecursiveDeep.java:30)
at java.util.concurrent.RecursiveTask.exec(RecursiveTask.java:93)
at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:377)
at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:654)
at net.shipilev.ForkJoinRecursiveDeep.run(ForkJoinRecursiveDeep.java:60)
at net.shipilev.generated.ForkJoinRecursiveDeep.run_AverageTime_measurementLoop(ForkJoinRecursiveDeep.java:155)
at net.shipilev.generated.ForkJoinRecursiveDeep.run_AverageTime(ForkJoinRecursiveDeep.java:125)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.openjdk.jmh.runner.LoopMicroBenchmarkHandler$BenchmarkTask.invokeBenchmark(LoopMicroBenchmarkHandler.java:229)
at org.openjdk.jmh.runner.LoopMicroBenchmarkHandler$BenchmarkTask.call(LoopMicroBenchmarkHandler.java:197)
at org.openjdk.jmh.runner.LoopMicroBenchmarkHandler$BenchmarkTask.call(LoopMicroBenchmarkHandler.java:182)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)



To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Vitaly Davidovich

unread,
Apr 15, 2014, 4:58:17 PM4/15/14
to mechanica...@googlegroups.com

Just a guess but high sys% may be due to use of sleeping wait strategy (I realize that's what original benchmark used).  Out of curiosity, can you try using the busy spin one?

Sent from my phone

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Aleksey Shipilev

unread,
Apr 15, 2014, 5:02:20 PM4/15/14
to mechanica...@googlegroups.com
Sorry about that, I'm using the JDK 8 "common pool" APIs to invoke the
F/J task (the same way Streams invoke it). Run with JDK 8 ;)

-Aleksey.
> <mailto:mechanical-sympathy%2Bunsu...@googlegroups.com>
> > > <mailto:mechanical-symp...@googlegroups.com
> <mailto:mechanical-sympathy%2Bunsu...@googlegroups.com>>.
> > > For more options, visit https://groups.google.com/d/optout
> > <https://groups.google.com/d/optout>.
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "mechanical-sympathy" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> > an email to mechanical-symp...@googlegroups.com
> <mailto:mechanical-sympathy%2Bunsu...@googlegroups.com>
> > <mailto:mechanical-symp...@googlegroups.com
> <mailto:mechanical-sympathy%2Bunsu...@googlegroups.com>>.
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google
> Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to mechanical-symp...@googlegroups.com
> <mailto:mechanical-sympathy%2Bunsu...@googlegroups.com>.

Rüdiger Möller

unread,
Apr 15, 2014, 5:33:26 PM4/15/14
to mechanica...@googlegroups.com
The original bench required Threads+1 cores as there are N workers and one resultreclaimer, so with 4 cores you might get artifacts (just try running with threads = 3). You can see this in form of an artifact in the blog benchmarks when using all available cores.
I did them on AMD 2 Socket opteron at 16c 8t and on a Xeon (6 gen) dual socket 12c HT off and Oracle JDK 1.7. 
Since I already left my nerd-home in favor of my gf's home, I have no access right now to real machines :).

Some questions as I am not familar with JMH (i know I should ..)

- does measurement include minor GC, and does it run long enough to actually get this into results ?
- what does @Fork(5) mean ?

I'll run this on tomorrow on the 12c intel box.
From a first glance the code seems ok.
Will setup project and run on i7 laptop ...
>     > For more options, visit https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send

Michael Barker

unread,
Apr 15, 2014, 5:33:37 PM4/15/14
to mechanica...@googlegroups.com
So I have a small play around with the benchmark and I get some interesting results.  My first observation is that the Disruptor implementation is creating too many threads.  The Disruptor requires an additional thread for the result calculation.  So at most  PiEventProcessor procs[] = new PiEventProcessor[Shared.THREADS - 1], which improved performance.  As an experiment I started reducing the number of thread further and with Shared.THREADS - 2 I got result with the Disruptor that were comparable to the fast F/J runs, however I got the best return when I set PiEventProcessor procs[] = new PiEventProcessor[1].  Probably worth double checking the code is doing the right thing and also worth having a single threaded implementation of the algorithm as a baseline.

Benchmark                         Mode   Samples         Mean   Mean error    Units
n.s.Disruptor.run                 avgt        25       51.975        0.230    ms/op
n.s.ForkJoin.run                  avgt        25      110.370        3.602    ms/op
n.s.ForkJoinRecursive.run         avgt        25      101.573        1.190    ms/op
n.s.ForkJoinRecursiveDeep.run     avgt        25      105.736        0.955    ms/op

Mike


To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Rüdiger Möller

unread,
Apr 15, 2014, 5:37:35 PM4/15/14
to mechanica...@googlegroups.com
The original version improved with each core added, so there must be some error in the port.
>     >     > For more options, visit https://groups.google.com/d/optout
>     >     <https://groups.google.com/d/optout>.
>     >
>     > --
>     > You received this message because you are subscribed to the Google
>     > Groups "mechanical-sympathy" group.
>     > To unsubscribe from this group and stop receiving emails from it, send
>     > For more options, visit https://groups.google.com/d/optout.
>
>     --
>     You received this message because you are subscribed to the Google
>     Groups "mechanical-sympathy" group.
>     To unsubscribe from this group and stop receiving emails from it,
>     For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send

> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Aleksey Shipilev

unread,
Apr 15, 2014, 5:39:29 PM4/15/14
to mechanica...@googlegroups.com
On 04/16/2014 01:33 AM, Rüdiger Möller wrote:
> - does measurement include minor GC, and does it run long enough to
> actually get this into results ?

Haven't checked.

> - what does @Fork(5) mean ?

It means do 5 consecutive forks for a single test and aggregate their
results. This helps to quantify run-to-run variance.

-Aleksey.

Aleksey Shipilev

unread,
Apr 15, 2014, 5:49:18 PM4/15/14
to mechanica...@googlegroups.com
On 04/16/2014 01:33 AM, Michael Barker wrote:
> So I have a small play around with the benchmark and I get some
> interesting results. My first observation is that the Disruptor
> implementation is creating too many threads. The Disruptor requires an
> additional thread for the result calculation. So at most
> PiEventProcessor procs[] = new PiEventProcessor[Shared.THREADS - 1],
> which improved performance. As an experiment I started reducing the
> number of thread further and with Shared.THREADS - 2 I got result with
> the Disruptor that were comparable to the fast F/J runs, however I got
> the best return when I set PiEventProcessor procs[] = new
> PiEventProcessor[1].

I can confirm the improvement with a quick run:

Benchmark Mean Mean error Units
n.s.Disruptor.run 329.025 101.422 ms/op
n.s.ForkJoin.run 725.392 165.252 ms/op
n.s.ForkJoinRecursive.run 587.997 66.454 ms/op
n.s.ForkJoinRecursiveDeep.run 650.515 122.800 ms/op

...however I think the benchmark is fragile in changing *just* the
number of PiEventProcessors; if you have less processors, some of the
partitions are not covered, and you need to fix that, say, for a single
processor:

+ partitionId = (partitionId == (Shared.THREADS - 1)) ? 0 : partitionId + 1;
- paritionId = 0;

With this change:

Benchmark Mean Mean error Units
n.s.Disruptor.run 1188.144 79.235 ms/op
n.s.ForkJoin.run 723.150 265.222 ms/op
n.s.ForkJoinRecursive.run 579.816 11.223 ms/op
n.s.ForkJoinRecursiveDeep.run 611.394 24.488 ms/op

-Aleksey.

Rüdiger Möller

unread,
Apr 15, 2014, 6:14:56 PM4/15/14
to mechanica...@googlegroups.com
I tried different strats in the original benchmarks, but it did not make a difference. This might be different here as we have not enough cores with laptop testing ..

>     > For more options, visit https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send

> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Rüdiger Möller

unread,
Apr 15, 2014, 6:15:36 PM4/15/14
to mechanica...@googlegroups.com
Runtime.availableProcessors is bad for hyperthreading CPU's. It returns 8 for my 4 core CPU.

Rüdiger Möller

unread,
Apr 15, 2014, 7:33:29 PM4/15/14
to
double post, see below. Google groups worst webapp ever.

Rüdiger Möller

unread,
Apr 15, 2014, 9:53:49 PM4/15/14
to
Update: I missed the ForkJoinDeep implementation which actually splits down to a single job.

The point of the benchmark is to measure the speed of processing 1 million individual, independent processing jobs of short duration. I simulate processing by computing a tiny slice of PI as a place holder for business logic.

A correct implementation would have to submit 1 million jobs from the main loop (like incoming events/requests). In a event sourced system e.g. when parallelizing decoding you can't do recursive divide and conquer, you'll have to queue incoming jobs/events somehow.

This kind of micro-parallelization is very common and typical in real application where a process receives hundreds of thousands if not millions of events or requests per second with very low processing time for each individual request/event. It's not about parallelizing some huge task, so the Pi example might create a wrong impression. The Pi computation is used in the bench only to bring in some kind of processing "work".

However this use case does not match FJ and this might be the reason people "struggling" with its application, as 95% of non GUI apps are servers processing incoming events/requests.
 
Anyway I am very impressed as the benchmark shows that scheduling overhead of F/J is much lower than I initially assumed (which probably means I have to surf j2ee forums for some weeks ;).

Another thing is, Disruptor is faster with 1.7 on my machine ..

Am Dienstag, 15. April 2014 22:46:52 UTC+2 schrieb Aleksey Shipilev:
>     > For more options, visit https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send

Michael Barker

unread,
Apr 15, 2014, 9:06:50 PM4/15/14
to mechanica...@googlegroups.com
I can cheat slightly and get the Disruptor's performance to within 10% of the best F/J implementation.  I had to reduce the number of threads to be Shared.THREADS - 1, because we need to reserve a core for the PiResultCalculator and increase the size of the ring buffer to size = Integer.highestOneBit(Shared.SLICES) << 1 so that the publishing thread could queue all of the work without having to wait for some of the other threads to complete some of the work, this allows the maximum number of threads to crunch away on the data.  Not sure this add much to the original discussion though...

Results:

Benchmark                         Mode   Samples         Mean   Mean error    Units
n.s.Disruptor.run                 avgt        25      109.137        2.653    ms/op
n.s.ForkJoin.run                  avgt        25      108.046        2.687    ms/op
n.s.ForkJoinRecursive.run         avgt        25      101.578        1.315    ms/op
n.s.ForkJoinRecursiveDeep.run     avgt        25      105.796        1.214    ms/op

Patch:

diff --git a/src/main/java/net/shipilev/Disruptor.java b/src/main/java/net/shipilev/Disruptor.java
index 2d14a23..6a08cbf 100644
--- a/src/main/java/net/shipilev/Disruptor.java
+++ b/src/main/java/net/shipilev/Disruptor.java
@@ -95,8 +95,9 @@ public class Disruptor {
     public void setup() {
         PiEventFac fac = new PiEventFac();
         executor = Executors.newCachedThreadPool();
-        disruptor = new com.lmax.disruptor.dsl.Disruptor<PiJob>(fac, 16384, executor, ProducerType.SINGLE, new SleepingWaitStrategy());
-        PiEventProcessor procs[] = new PiEventProcessor[Shared.THREADS];
+        int size = Integer.highestOneBit(Shared.SLICES) << 1;
+        disruptor = new com.lmax.disruptor.dsl.Disruptor<PiJob>(fac, size, executor, ProducerType.SINGLE, new SleepingWaitStrategy());
+        PiEventProcessor procs[] = new PiEventProcessor[Shared.THREADS - 1];
         res = new PiResultReclaimer(Shared.SLICES);
 
         for (int i = 0; i < procs.length; i++) {
@@ -126,7 +127,7 @@ public class Disruptor {
             piJob.partitionId = partitionId;
             ringBuffer.publish(seq);
 
-            partitionId = (partitionId == (Shared.THREADS - 1)) ? 0 : partitionId + 1;
+            partitionId = (partitionId == (Shared.THREADS - 2)) ? 0 : partitionId + 1;
         }
 
         res.latch.await();

Mike.



On 16 April 2014 11:03, Rüdiger Möller <moru...@gmail.com> wrote:
Sorry, I'd argue the Fork Join benchmark implementation misses the point.

The point of the benchmark is to measure the speed of processing 1 million individual, independent processing jobs. I simulate processing by computing a slice of PI as a place holder for business logic.

However your test basically creates one Job per thread:
"
        for (int i = 0; i < Shared.THREADS; i++) {
            PiForkJoinTask task = new PiForkJoinTask(Shared.SLICES / Shared.THREADS);
            task.fork();
            tasks.add(task);
        }

"

and then computes the slices in a big loop:
"
        protected Double compute() {
            double acc = 0D;
            for (int s = 0; s < slices; s++) {
                acc += Shared.calculatePi(s);
            }
            return acc;
        }
"

so it basically does 4 jobs computing each a 25.000.000 million slice of Pi instead of processing 1 million jobs each computing a 100 iteration slice of Pi. 

A correct implementation would have to submit 1 million jobs from the main loop. In a event sourced system e.g. when parallelizing decoding you can't do stuff like that.
If implemented like this, FJ will be at a huge disadvantage just because one has to create a lot of FJ jobs, while disruptor just recycles the ringbuffer objects. 

This kind of micro-parallelization is very common in real application where a process receives hundreds of thousands of events or requests per second with very low processing time for each request/event.

 
Am Dienstag, 15. April 2014 22:46:52 UTC+2 schrieb Aleksey Shipilev:
>     > For more options, visit https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Rüdiger Möller

unread,
Apr 15, 2014, 9:34:11 PM4/15/14
to
[Upd] You have to compare with ForkJoinDeep as the other tests do not divide down to a single 100-iteration PI slice, which means scheduling overhead is saved by simply computing larger jobs. Still it is impressing how low the overhead of scheduling actually is. Especially effects of work stealing are observable as with a thread pool impl, FJ faster FJRecursive faster FCRecursiveDeep, but because of work stealing, shorter pieces of work seem to pay off.
>     > For more options, visit https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Rüdiger Möller

unread,
Apr 15, 2014, 9:51:15 PM4/15/14
to mechanica...@googlegroups.com
On my I7 laptop (T=3) [did not make the q size adjustment]

1.7 
Benchmark             Mode   Samples         Mean   Mean error    Units
n.s.Disruptor.run     avgt        60      355,577       13,418    ms/op

1.8
Benchmark                         Mode   Samples         Mean   Mean error    Units
n.s.Disruptor.run                 avgt        60      518,855       14,611    ms/op
n.s.ForkJoinRecursiveDeep.run     avgt        60      280,507        1,534    ms/op

FJ is an amazing piece of work, I am somewhat baffled .. but how can I leverage this for streaming input ?


Am Mittwoch, 16. April 2014 03:06:50 UTC+2 schrieb mikeb01:

>     > For more options, visit https://groups.google.com/d/optout 
>     <https://groups.google.com/d/optout>. 

> -- 
> You received this message because you are subscribed to the Google 
> Groups "mechanical-sympathy" group. 
> To unsubscribe from this group and stop receiving emails from it, send 

> For more options, visit https://groups.google.com/d/optout

-- 
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Michael Barker

unread,
Apr 15, 2014, 9:58:05 PM4/15/14
to mechanica...@googlegroups.com
If you don't increase the size of the q, then you should lower then number of processor threads to Shared.THREADS - 2 as the producer thread will contend with the processing threads.

Mike.


To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Aleksey Shipilev

unread,
Apr 16, 2014, 1:24:07 AM4/16/14
to mechanica...@googlegroups.com
On 04/16/2014 03:04 AM, Rüdiger Möller wrote:
> Sorry, I'd argue the Fork Join benchmark implementation misses the point.
>
> The point of the benchmark is to measure the speed of processing 1
> million individual, independent processing jobs. I simulate processing
> by computing a slice of PI as a place holder for business logic.

That is why we have ForkJoinRecursiveDeep.

> This kind of micro-parallelization is very common in real application
> where a process receives hundreds of thousands of events or requests per
> second with very low processing time for each request/event.

The original benchmark states nothing about it. You wanted to compute N
slices of M iterations each, ForkJoinRecursiveDeep does just that. True
that FJP may not be useful when you face a specific use case which
pushes you to submit millions of tasks in the pool, but there are
obvisouly multiple ways you can tackle the problem. Solving the
particular problem (computing Pi in slices) in F/J-natural way brings
lots of performance, and therefore, this particular experimenta serves
as a perfect counter-example that this:

On 04/15/2014 10:32 PM, Rüdiger Möller wrote:
> If I want to multithread for performance, I am better off using a
> pipelining scheme, as the intrinsic overhead of lambda/FJ is high
> compared to a pipelining (or well implemented actor) approach.

...is generally wrong.

-Aleksey.

Aleksey Shipilev

unread,
Apr 16, 2014, 2:12:36 AM4/16/14
to mechanica...@googlegroups.com
On 04/16/2014 03:04 AM, Rüdiger Möller wrote:
> A correct implementation would have to submit 1 million jobs from the
> main loop. In a event sourced system e.g. when parallelizing decoding
> you can't do stuff like that.
> If implemented like this, FJ will be at a huge disadvantage just because
> one has to create a lot of FJ jobs, while disruptor just recycles the
> ringbuffer objects.

Again worrying about "correct" implementation is meaning the particular
narrow use case... I get what you are saying, but reusing objects is a
performance cheat :) Anyway, here's FJP submitting all the slices as the
individual jobs and also cheating at reuse:
https://github.com/shipilev/disrupting-fjp/blob/master/src/main/java/net/shipilev/ForkJoinReuse.java

Benchmark Mean Mean error Units
n.s.Disruptor.run 790.382 16.472 ms/op
n.s.ForkJoin.run 659.493 55.216 ms/op
n.s.ForkJoinRecursive.run 572.437 16.728 ms/op
n.s.ForkJoinRecursiveDeep.run 600.702 7.245 ms/op
n.s.ForkJoinReuse.run 772.754 4.545 ms/op

...still not slower.

(Although I don't get what's the most optimal config for Disruptor --
pull requests appreciated! Still, it's funny that try to catch up after
out-of-box FJP...)

-Aleksey.

Rüdiger Möller

unread,
Apr 16, 2014, 5:20:50 AM4/16/14
to mechanica...@googlegroups.com
see inline comments


Am Mittwoch, 16. April 2014 07:24:07 UTC+2 schrieb Aleksey Shipilev:
On 04/16/2014 03:04 AM, Rüdiger Möller wrote:
> Sorry, I'd argue the Fork Join benchmark implementation misses the point.
>
> The point of the benchmark is to measure the speed of processing 1
> million individual, independent processing jobs. I simulate processing
> by computing a slice of PI as a place holder for business logic.

That is why we have ForkJoinRecursiveDeep.


Yes, i initially missed it (have corrected my post later on, however it seems you got to see the initial version :-) )
 
> This kind of micro-parallelization is very common in real application
> where a process receives hundreds of thousands of events or requests per
> second with very low processing time for each request/event.

The original benchmark states nothing about it. You wanted to compute N
slices of M iterations each, ForkJoinRecursiveDeep does just that. True
that FJP may not be useful when you face a specific use case which
pushes you to submit millions of tasks in the pool, but there are
obvisouly multiple ways you can tackle the problem. Solving the
particular problem (computing Pi in slices) in F/J-natural way brings
lots of performance, and therefore, this particular experimenta serves
as a perfect counter-example that this:


Agree I did not state it explicitely (as for me this is absolutely natural in the area I am working). However its implicitely contained in the way the benchmarks are constructed (the multithreading bench also pushes 1 million individual jobs, same for the Akka one). My bad not stating this. However I'd like to point out that this is not a "narrow" use case, but standard as any request/event processing server faces the paralellizing problem this way.
 
On 04/15/2014 10:32 PM, Rüdiger Möller wrote:
> If I want to multithread for performance, I am better off using a
> pipelining scheme, as the intrinsic overhead of lambda/FJ is high
> compared to a pipelining (or well implemented actor) approach.

...is generally wrong.

Its true for the common case of having to process a series of tiny independent computing jobs aka ordinary requests/events. Since business logic in most cases is trivial, scheduling of jobs (=events, requests) becomes the dominating factor and this is the issue pipelining/disruptor addresses.
Agree that FJ does an amazing job in paralellizing big computational tasks.

-ruediger


Rüdiger Möller

unread,
Apr 16, 2014, 5:37:32 AM4/16/14
to mechanica...@googlegroups.com
Cool. I'll test this on a 12 core machine (after applying some pedantism to the new impl ofc :-) ). Will take some time (nasty real world work to do ..)


> about "correct" implementation is meaning the particular 
> narrow use case... I get what you are saying

as said before, in my world the "narrow case" is the default (servers, event sourced systems). 
Any possibility to backport the FJ bench to 1.7 ? I noticed a significant performance degradation of disruptor with 1.8.

Cheers,
Rüdiger

Am Mittwoch, 16. April 2014 08:12:36 UTC+2 schrieb Aleksey Shipilev:

Aleksey Shipilev

unread,
Apr 16, 2014, 5:50:05 AM4/16/14
to mechanica...@googlegroups.com
On 04/16/2014 01:37 PM, Rüdiger Möller wrote:
> Cool. I'll test this on a 12 core machine (after applying some pedantism
> to the new impl ofc :-) ). Will take some time (nasty real world work to
> do ..)

I think point-wise measures are not enough, btw. We need to measure this
with different number of threads in the pool, and different number of
threads submitting the work. JMH API can simplify coding up the scenario
like that.

>> about "correct" implementation is meaning the particular
>> narrow use case... I get what you are saying
>
> as said before, in my world the "narrow case" is the default (servers,
> event sourced systems).

Yeah, I did not mean to say "narrow", sorry (if you are going to write
mail just after waking up, you gonna have a bad time). But, I did want
to say that in scenarios which FJP was designed for (that is, massive
*data* parallelism, which fits the use case of JDK 8 Streams), it works
remarkably well.

> Any possibility to backport the FJ bench to 1.7 ? I noticed a
> significant performance degradation of disruptor with 1.8.

It is possible, but the caveat would be FJP missing lots of very
profitable optimizations done for Streams work. Pick your poison I
guess. That said, I think something like this will be runnable on JDK 7:

diff --git a/src/main/java/net/shipilev/ForkJoinReuse.java
b/src/main/java/net/shipilev/ForkJoinReuse.java
index 63f39f4..456d3a8 100644
--- a/src/main/java/net/shipilev/ForkJoinReuse.java
+++ b/src/main/java/net/shipilev/ForkJoinReuse.java
@@ -7,11 +7,13 @@ import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;

import java.util.ArrayList;
import java.util.List;
+import java.util.concurrent.ForkJoinPool;
import java.util.concurrent.RecursiveTask;
import java.util.concurrent.TimeUnit;

@@ -23,6 +25,8 @@ import java.util.concurrent.TimeUnit;
@OutputTimeUnit(TimeUnit.MILLISECONDS)
public class ForkJoinReuse {

+ private ForkJoinPool fjp;
+
/*
The fork-join task below is used as "just" the Callable,
and "reuses" the submitted tasks.
@@ -37,6 +41,11 @@ public class ForkJoinReuse {
}
}

+ @Setup
+ public void setup() {
+ fjp = new ForkJoinPool();
+ }
+
@GenerateMicroBenchmark
public double run() throws InterruptedException {
List<PiForkJoinTask> tasks = new ArrayList<>();
@@ -49,7 +58,7 @@ public class ForkJoinReuse {
}

for (PiForkJoinTask task : tasks) {
- task.fork();
+ fjp.submit(task);
}

double acc = 0D;
@@ -59,7 +68,7 @@ public class ForkJoinReuse {
task.slice = s;
acc += task.join();
task.reinitialize();
- task.fork();
+ fjp.submit(task);
}
s += tasks.size();
}

-Aleksey.

Rüdiger Möller

unread,
Apr 16, 2014, 6:36:25 AM4/16/14
to mechanica...@googlegroups.com


I think point-wise measures are not enough, btw. We need to measure this
with different number of threads in the pool, and different number of
threads submitting the work. JMH API can simplify coding up the scenario
like that.

i plan to simply increase the number of cores and measure scaling like in the original blogpost (machine is identical). Multithreaded 'feeding' threads are atypical, one usually avoids MPMC as a single thread is faster in feeding events into the processing unit than multiple ones. However still another interesting testcase ..
 

Yeah, I did not mean to say "narrow", sorry (if you are going to write
mail just after waking up, you gonna have a bad time).

actually I am a dead man walking because I fiddled around til 4am and have to work today ..
 
But, I did want
to say that in scenarios which FJP was designed for (that is, massive
*data* parallelism, which fits the use case of JDK 8 Streams), it works
remarkably well.

You know what ? I agree :-)
 

> Any possibility to backport the FJ bench to 1.7 ? I noticed a
> significant performance degradation of disruptor with 1.8.

It is possible, but the caveat would be FJP missing lots of very
profitable optimizations done for Streams work. Pick your poison I
guess. That said, I think something like this will be runnable on JDK 7:


thanks :-)

-ruediger 

Aleksey Shipilev

unread,
Apr 16, 2014, 7:11:13 AM4/16/14
to mechanica...@googlegroups.com
On 04/16/2014 02:36 PM, Rüdiger Möller wrote:
> I think point-wise measures are not enough, btw. We need to measure
> this
> with different number of threads in the pool, and different number of
> threads submitting the work. JMH API can simplify coding up the
> scenario
> like that.
>
>
> i plan to simply increase the number of cores and measure scaling like
> in the original blogpost (machine is identical). Multithreaded 'feeding'
> threads are atypical, one usually avoids MPMC as a single thread is
> faster in feeding events into the processing unit than multiple ones.
> However still another interesting testcase ..

Not only I think single client *is* a special case (most of the business
I know are running for multiple clients; HFT is different in this
regard), benchmarking-wise it could introduce the unwarranted effects of
clients unable to saturate the executors. See below.

I pushed a few tiny changes to worspace, including the one which does
JDK 8 Stream-ish Pi calculation (still arguably fair since it will
decompose sensibly, not until the last slice). Here's the result on my
laptop:

Benchmark Mean Mean error Units
n.s.Disruptor.run 831.288 55.772 ms/op
n.s.ForkJoin.run 652.163 41.174 ms/op
n.s.ForkJoinRecursive.run 573.230 8.979 ms/op
n.s.ForkJoinRecursiveDeep.run 607.400 6.469 ms/op
n.s.ForkJoinReuse.run 833.257 22.642 ms/op
n.s.Streams.run 578.153 7.950 ms/op

...and here is the result from 2x12x2 Ivy Bridge, Linux x86_64, JDK 8 GA:

Benchmark Mean Mean error Units
n.s.Disruptor.run 389.008 76.407 ms/op
n.s.ForkJoin.run 21.705 0.187 ms/op
n.s.ForkJoinRecursive.run 23.254 0.156 ms/op
n.s.ForkJoinRecursiveDeep.run 22.557 0.137 ms/op
n.s.ForkJoinReuse.run 1424.009 720.859 ms/op
n.s.Streams.run 23.193 0.143 ms/op

So, even when we decompose a lot, FJP is faster. ForkJoinReuse test is
interesting, because I *think* this is the evidence of load starvation.
Running with 12 client threads (can I actually run Disruptor with
multiple clients in this case?) yields:

Benchmark Mean Mean error Units
n.s.Disruptor.run 1761.069 122.544 ms/op
n.s.ForkJoin.run 243.990 8.990 ms/op
n.s.ForkJoinRecursive.run 236.459 3.095 ms/op
n.s.ForkJoinRecursiveDeep.run 268.558 8.486 ms/op
n.s.ForkJoinReuse.run 929.113 54.684 ms/op
n.s.Streams.run 242.315 5.961 ms/op

(Note this is an average time across the client threads, so perfect
scalability means average time not changing). Here, under load, we see
the executors behave "a little" differently.

-Aleksey.

Rüdiger Möller

unread,
Apr 16, 2014, 1:56:42 PM4/16/14
to mechanica...@googlegroups.com
see inline


Am Mittwoch, 16. April 2014 13:11:13 UTC+2 schrieb Aleksey Shipilev:
Not only I think single client *is* a special case (most of the business
I know are running for multiple clients; HFT is different in this
regard), benchmarking-wise it could introduce the unwarranted effects of
clients unable to saturate the executors. See below.

I am not on the HFT side, we build exchange systems, so we have to deal with input from HFT ..

(a) Our systems are clustered, they communicate using brokerless reliable multicast udp. So a node basically is a cluster client and server listening and writing messages on/to different topics.
 Typically, all messages of a topic are received in a single thread and often have to be processed in sequence, which adds an additional constraint on parallel processing.

Any event sourced distributed system will work along the lines of this, so while its not too common, its a typical pattern fo event sourced systems. 
Real world clients of the clustered system are handled by dedicated gateway servers, which might also apply similar patterns (e.g. single thread grabs incoming requests (NIO) and adds them to the processing pipeline). Anyway we use plain threadpools to serve low frequency clients, HFT clients have dedicated gateways working different.

(b) Ofc its more typical to have Appserver/Database (monolithic or farmed) out there, however these are probably not the kind of applications which are cpu/latency challenged as they tend to wait for IO most of the time.

If you consider the system architecture outlined in (a) you might understand that people from this background have a hard time to figure out how to make use of FJ and that e.g. pipelining/actors is a natural fit for these systems (especially processing in-parallel but keeping sequence).
For (b) FJ will be much easier to apply with probably great throughput improvements, however I'd imagine it still could be problematic as the application with FJ has no control what thread is selected for actual execution of a particular job/request (deadlocks anyone ?). Not sure about that, needs real world experience.


I pushed a few tiny changes to worspace, including the one which does
JDK 8 Stream-ish Pi calculation (still arguably fair since it will
decompose sensibly, not until the last slice). Here's the result on my
laptop:


Haven't had time to star the stuff you shipilev'ed :-). For sure it will help to get a better grasp of FJ. I somehow have to run them on my own in order to get a feel, that's why I want to do an iteration of my old blog bench with a FJ impl taken from you.

-Rüdiger

Rüdiger Möller

unread,
Apr 16, 2014, 5:26:12 PM4/16/14
to

Hi, 

I finally did the original benchmark with an adaption of your ForkJoinRecursiveDeep on my opteron 2x8c8t. The results are amazing. FJ scales perfectly even on this machine (notorious to descale whenever cross socket comes into play) ! The descaling of disruptor does not occur on intel. I'll do tomorrow.

The standalone 1.7 Test (copy pasteable) is here: 

https://github.com/RuedigerMoeller/abstractor/blob/master/src/benchmark/java/de/ruedigermoeller/abstraktor/sample/ForkJoinRecursiveDeep.java

I had to adapt your impl to actually compute PI which required the addition of a 2cnd int to FJTask. I ran it using 1.7_45 (to keep everything same compared to original benchmarks). 










Am Mittwoch, 16. April 2014 13:11:13 UTC+2 schrieb Aleksey Shipilev:

Aleksey Shipilev

unread,
Apr 16, 2014, 5:21:39 PM4/16/14
to mechanica...@googlegroups.com
On 04/17/2014 01:15 AM, Rüdiger Möller wrote:
> <https://lh5.googleusercontent.com/-6QzYX6x9LOg/U07y1Le-JDI/AAAAAAAAAOI/VVNZUkUODRQ/s1600/fjointable.png>
>
> <https://lh3.googleusercontent.com/-f1rS4z9ftec/U07yxFAEcSI/AAAAAAAAAOA/io7KY_u3Z5U/s1600/ForkJoin.png>

(checks the thread subject)
So... This is still a thread on "Fork-Join *Calamity*", right?

* Error bounds would be nice to have on that graph.
* Disruptor results seem very worrying: what happens past the 9 cores?

-Aleksey.

Rüdiger Möller

unread,
Apr 16, 2014, 5:39:33 PM4/16/14
to mechanica...@googlegroups.com
googlegroups interface sucks hard, had to move to windoze in order to be able to comment the post .. JS is the future of programming btw.

Am Mittwoch, 16. April 2014 23:21:39 UTC+2 schrieb Aleksey Shipilev:
On 04/17/2014 01:15 AM, Rüdiger Möller wrote:
> <https://lh5.googleusercontent.com/-6QzYX6x9LOg/U07y1Le-JDI/AAAAAAAAAOI/VVNZUkUODRQ/s1600/fjointable.png>
>
> <https://lh3.googleusercontent.com/-f1rS4z9ftec/U07yxFAEcSI/AAAAAAAAAOA/io7KY_u3Z5U/s1600/ForkJoin.png>

(checks the thread subject)
So... This is still a thread on "Fork-Join *Calamity*", right?

erm .. but .. but .. 
 
* Error bounds would be nice to have on that graph.

Will you stop promoting your jmh shamelessly ! I just wanted a "cut and paste bench", i wanted to stick to the original blog benches which also where handrolled. Anyway error bars probably would not make a difference for this test.
 
* Disruptor results seem very worrying: what happens past the 9 cores?


Don't know. Mike Barker also had no clue. Its an Opteron artefact, does not happen on intel machines.
 
-Aleksey.

Martin Thompson

unread,
Apr 17, 2014, 3:50:30 AM4/17/14
to mechanica...@googlegroups.com
On 16 April 2014 22:39, Rüdiger Möller <moru...@gmail.com> wrote:
googlegroups interface sucks hard, had to move to windoze in order to be able to comment the post .. JS is the future of programming btw.

 
* Disruptor results seem very worrying: what happens past the 9 cores?


Don't know. Mike Barker also had no clue. Its an Opteron artefact, does not happen on intel machines.

Sorry for not being able to add value to this. Client work is getting in the way this week. :-)

I think the reason the Disruptor behaves this way on AMD for this is due to a floating point unit being shared between pairs of cores on Opteron. Intel cores have their own FP execution unit. I suspect, but unfortunately don't have time to verify, that the FJP does a much better job of coping when insufficient CPU resource exists compared to the Disruptor. The Disruptor is not a good general purpose framework to be used when more threads want to run than cores are available.

Try a calculation that is integer based for a comparison.

Work stealing pools can be a good solution for throughput focused use cases. The Disruptor is designed for response time focused use cases.

Martin...

Rüdiger Möller

unread,
Apr 17, 2014, 5:07:48 AM4/17/14
to mechanica...@googlegroups.com
see comments inline ..


Am Donnerstag, 17. April 2014 09:50:30 UTC+2 schrieb Martin Thompson:

I think the reason the Disruptor behaves this way on AMD for this is due to a floating point unit being shared between pairs of cores on Opteron. Intel cores have their own FP execution unit. I suspect, but unfortunately don't have time to verify, that the FJP does a much better job of coping when insufficient CPU resource exists compared to the Disruptor. The Disruptor is not a good general purpose framework to be used when more threads want to run than cores are available.


The machine has 32 'cores' and 16 FP units (dual socket operon). So if Linux does slightly reasonable scheduling, this should not happen. If Linux schedules the (8+x)''th thread to a 'virtual' core then this could be the reason, would have to pin in order to avoid this .. but well its an experiment not a 'benchmark'.
 

Work stealing pools can be a good solution for throughput focused use cases. The Disruptor is designed for response time focused use cases.


Its also easier (at least in my current mind state) to express task interdependencies with disruptor. Originally the test should simulate exactly this: the pi parts are computed in parallel, then there is 'Resultreclaimer' which picks up the results in the order they were submitted to the ringbuffer. By reducing the amount of computational work of each 'Pi-computation-event' one could establish the per-event overhead of disruptor vs actor vs X. 'Computing Pi' is an unfortunate example as one would associate "dvide and conquer a big computing taks" with this ..
Wouldn't it make sense to use the ringbuffer part of disruptor but schedule work/event handlers using F/J to profit from the work stealing implementation ?

Rüdiger Möller

unread,
Apr 17, 2014, 5:45:28 AM4/17/14
to
On Intel numbers are somewhat different (jdk 1.7, dual socket 6c6t @3ghz, ht turned off). Numbers@10+11 are skewed as disruptor actually uses additional threads to feed jobs and reclaim results. Both variants scale well.

F/J
1: 728
2: 364
3: 245
4: 190
5: 154
6: 127
7: 110
8: 98
9: 86
10: 78
11: 70
12: 67

Disruptor
1: 664
2: 332
3: 222
4: 167
5: 135
6: 111
7: 96
8: 84
9: 84
10: 74
11: 108
12: 113


Am Sonntag, 13. April 2014 09:38:49 UTC+2 schrieb Martin Thompson:
The following article proposes that the FJ framework and parallel collections could be a calamity for Java. 


I've been uncomfortable with FJ for some time. I see people struggling to design for it, debug it, and more often fail to get performance benefits. Other approaches such as pipelining tasks can often be way more effective and easier to reason about.

I also find it amusing that after years of trying to hide databases behind ORMs that to use the parallel collections effectively you need to understand set theory for writing good queries.

The following blog shows just how bloggers can so easily misuse parallel collections by having no sympathy for CPU resource on a system. I think this is only the tip of the iceberg.


I'm curious to know if others have doubts about either Fork-Join or parallel collections, or if these are really good ideas and somehow the penny has not dropped for me? I'd really like to see a good evidence based debate on this subject.

Regards,
Martin...


Rüdiger Möller

unread,
Apr 17, 2014, 2:13:33 PM4/17/14
to
New chart with sequential processing:

I added results of sequential fork/join processing (I know I went somewhat off-topic as Martins initial issue had a much broader scope, but anyway I think its important to show, that fork join is not well suited as a general solution to parallel execution).

Conclusion:
  • Fork Join is superior in splitting a large computing task to multiple processors
  • It fails hard on patterns where high rates of individual small events or requests need to get processed partially (or completely) in parallel. Probably Akka's bad scaling in this test is caused by using FJ under the hood.
FJSeq is a slightly modified copy of Aleksey Shipilev's "Sequential with reuse" implementation. In FJSeq_1 I tried to speed this up, however this yielded in better single threaded performance but worse scaling.

Like 90% of real world (server) applications are processing independent, short running events. Splitting up big computational tasks is somewhat rare.

Therefore (besides more fundamental considerations scratched in OP) I think adding a pipelining based concurrency component to JDK would be  reasonable (e.g. something like Disruptor).


source:

ymo

unread,
Apr 17, 2014, 5:08:13 PM4/17/14
to mechanica...@googlegroups.com
Rüdiger , it seems like FJ Rec graph is the one with the lowest values in red-orange all the way to 16 cores. Did i miss something ??? Whatst the catch ? ? 

Edward Harned

unread,
Apr 17, 2014, 6:06:59 PM4/17/14
to mechanica...@googlegroups.com
If we're done with benchmarks. Let me respond to the original subject: evidence based debate on the design of the F/J framework.

Fork/Join means to split the work into fragments and join the results. Specifically, fork repeatedly and when done, join the results. This framework doesn’t do that. It uses a recursive technique that is fork-and-join. The join does not, cannot, and will never work outside of a closed environment. The professor tried and tried again to make it work in the open environment of Java but it is fatally flawed; it can never work. Therefore, the framework’s basic design is a failure.

Once faced with a design failure a professional application developer would go back to the drawing board. The professor did not do that. He butchered the bad design to replace the join failure with the CountedCompler class.

The CountedCompler class is an amateur attempt at scatter-gather since it lacks the structure to support scatter-gather. To an already complex computer program, he added even more complexity and bad programming. (Search for “CountedCompler” in the F/J classes. The classes contain specific code just to support the new class.) I don’t know about Oracle’s programming standards but in the professional application development community, such practice would result in termination. As the article points out – CountedCompler is just whack-a-mole. 

Deques/Submission-Queues are a feature for applications that run on clusters of computers (Cilk for one.) The network communication cost for processors to load balance queues is very high so using a deque for each processor and a submission queue is necessary.

Java runs on shared memory computers. There is no network communication cost between threads to load balance FIFO queues. It is simple to schedule a Task (initial or forked) into the queue with the lowest pending workload. Deques/Submission-Queues are pointless. Supporting a Deques/Submission-Queue design, with its horrendous complexity, just to adhere to the original academic research paper is amateur.

There are other ways of doing F/J. I gave you a reference to an application server that uses a simple scatter-gather.

The basic design for this framework is a failure as well as the Deques/Submission-Queues feature. Brilliant minds continue to play whack-a-mole to make it function. What those people ignore is that whack-a-mole always, always, always fails. People use software for other than its indented uses especially outside of a research arena. There will come a day when whack-a-mole no longer works. When the complexity overwhelms even the brightest minds, then the only choice will be to replace this failure with a good, clean, simple design. You know that old saying, “You can pay now or you can pay more later.”

I think the biggest problem is that people buy a product because of its packaging. How many developers have thoroughly looked at the F/J framework structure before declaring it another fine product of the professor?

ed

Aleksey Shipilev

unread,
Apr 18, 2014, 5:19:33 AM4/18/14
to mechanica...@googlegroups.com
On 04/17/2014 10:05 PM, Rüdiger Möller wrote:
> New chart with sequential processing:
>
> I added results of sequential fork/join processing (I know I went
> somewhat off-topic as Martins initial issue had a much broader
> scope, but anyway I think its important to show, that *fork join is
> not well suited* as a general solution to parallel execution).

I think people would be much better off not saying "general" in 99.99%
of the cases. There's nothing "general" at all.

> Conclusion:

Before you jump to conclusions, make sure you did three things:
a) Do proper benchmarking: I already did the JMH stub for FJP and
Disruptor -- not only because it does introduce run-to-run variance
estimates, computes error bounds, et cetera, but also because it is more
easily reviewable.
b) From the usages you are having for FJP and the previous requests, I
infer you are running JDK 7? Although I specifically called this
experiment potentially misleading, because it misses the major FJP
optimizations in JDK 8.
c) Understanding the performance model includes the research how all
implementations react to changing the duration of those "events", in
order to quantify what does each implementation consider to be "unfitting".

> Like 90% of real world (server) applications are processing
> independent, short running events. Splitting up big computational
> tasks is somewhat rare.

As someone else said in this thread, mentioning percents of real world
"should be avoided as its strong subjective color unless you have at
less a public poll to support this". JDK 8 Streams is the major evident
counter-example of this.

> Therefore (besides more fundamental considerations scratched in OP) I
> think adding a pipelining based concurrency component to JDK would be
> reasonable (e.g. something like Disruptor).

Cool, advocate the need for that implementation with better due
diligence (see above), push it through the community process (e.g. JEP),
and actually work on coding/adjusting it for JDK. This would be real
community work :)

-Aleksey.

Rüdiger Möller

unread,
Apr 18, 2014, 5:25:23 AM4/18/14
to mechanica...@googlegroups.com
Real servers/event sourced systems face the problem of parallelizing processing sequentially instreaming independent events. F/J excels when I batch 1 million (interdependent) jobs and F/J schedules them optimized. However if i feed them in as 1 million independent series of jobs, it performs poorly.

However in real systems the dominant use-case is the latter (e.g. Akka). Another example: F/J would is very fast at sorting a 100 million array of elements, but slow when I have to sort 10 million arrays of length 10 without being allowed to batch them in large chunks.

Martin Thompson

unread,
Apr 18, 2014, 6:03:03 AM4/18/14
to mechanica...@googlegroups.com
Edward I believe a lot of this is conjecture. By no measure can I come to the conclusion that FJ is a failure or flawed in all contexts. There are many examples where divide-and-conquer approaches are a great success. Take the parallel new or parallel old collectors for GC. In the throughput cases they win hands down and a hat tip has to go to that. They are not ideal when response time is the goal but that is a different requirement.

Benchmarks show that FJP is an excellent general implementation of the divide-and-conquer approach to a large decomposable task.

My original intent was not to try and find flaws with FJP or parallel streams. It was to put them in context to find the sweet spot of their applicability. I'm aware of many concurrent and parallel techniques, and in my experience of building real-world systems, some work better than others given differing requirements.

In my experience most applications are reactive to events or incoming requests. There are typically web requests from HTTP clients or from messaging system based clients. As Gil pointed out in another thread, often request level parallelism deals well with that provided we have little data contention. The other mainstream class of application is analysis based applications that run queries against large datasets that are a combination of in-memory or on disk.

For systems that respond to requests or events I personally find it very natural to introduce parallelism by pipelining the problem or via actors. Some stages of a pipeline, or individual actors, need to do long running tasks that are CPU intensive or more likely a blocking external call for which work pools are a great solution.

The question I'm asking that started this thread is, how do the parallel streams fit into these typical scenarios that I see most applications follow? In the course grained request concurrency model offered by app servers, I'm scared of what people will do with parallel streams. In the event processing world I can see benefits to taking a reactive streams approach (http://www.reactive-streams.org/) but this is different in design goals to parallel collections.

For the data analysis style applications, I see much larger benefits from work on data structures than trying to gain micro-parallelism via the parallel streams API over sub-optimal data structure that exhibit far too much pointer chasing. 

In summary, I'm trying to see the usecase for how parallel streams fit into the overall design of the typical types of applications I see everyday? I'm not interested in a benchmark pissing contest. Benchmarks are great but only when comparing like-for-like. Maybe my little brain cannot see where I best use them in typical applications and I'm just too comfortable with existing tool kit of pipelines, actors, and work pools; each used where appropriate.

Martin...



Message has been deleted

Rüdiger Möller

unread,
Apr 18, 2014, 8:05:52 AM4/18/14
to
inline ..

 

I think people would be much better off not saying "general" in 99.99%
of the cases. There's nothing "general" at all.


ok, let's say FJ is not an all-fit (but it has somewhat been promoted at being that). It is a very well engineered solution to parallelize big compution tasks. It won't help in scaling up servers/event sourced systems. I mean, its not a halluzination but observable in many java server applications out there: Business logic is trivial cpu-wise frequently, most time is spent enqueing, decoding, scheduling. What do we want ? We want our apps to scale up if we throw hardware at them. Reality is, in most cases they don't.
 
Before you jump to conclusions, make sure you did three things:
 a) Do proper benchmarking: I already did the JMH stub for FJP and
Disruptor -- not only because it does introduce run-to-run variance
estimates, computes error bounds, et cetera, but also because it is more
easily reviewable.

The tests measures the ability to scale up a typical server load by adding cores. The JMH stubs do a run using all avaiable CPU's, I want to run the test with 1..N CPU's, do I have to write 16 JMH stubs to do that ? I just could not figure out quickly how to do that. 
Its not about measuring tiny differences in overall performance. Given that results have huge differences (e.g. factor 2-5), I don't see how errorbars and run to run difference (is ~5% as stated in the original blog post) will change the big picture.
 
 b) From the usages you are having for FJP and the previous requests, I
infer you are running JDK 7? Although I specifically called this
experiment potentially misleading, because it misses the major FJP
optimizations in JDK 8.

I measured both (i think i mentioned somewhere above). This particular test showed no significant performance difference for JDK 1.7/1.8 (at least on my 4 core box, did not check on the multi socket machines). But disruptor degraded like 15% on 1.8. Given that a big part of industry needs to stick with 1.7 at least for 2-5 years, I prefer testing 1.7 (especially as results do not differ in an order of magnitude). Also this way I don't have to redo all the tests done before. But anyway I'll check if 1.8 makes a difference at a higher number of cores.
 

 c) Understanding the performance model includes the research how all
implementations react to changing the duration of those "events", in
order to quantify what does each implementation consider to be "unfitting".


I don't understand this. There are 3 variables: Number of Pi jobs, Number of iterations per individual Job, Number of Threads/Cores involved. You can do 100k jobs (~events/requests) each computing a 1000 sclice, which will show similar performance amongst all variants tested. The shorter the processing time of an individual job, the more the differences of the scheduling/queueing scheme will show up.
 
> Like 90% of real world (server) applications are processing
> independent, short running events. Splitting up big computational
> tasks is somewhat rare.

As someone else said in this thread, mentioning percents of real world
"should be avoided as its strong subjective color unless you have at
less a public poll to support this". JDK 8 Streams is the major evident
counter-example of this.


Ok, maybe its dangerous to come up with a "90%" number. But we can make an educated guess its a very popular pattern of application load or would anyone disagree ?
 
> Therefore (besides more fundamental considerations scratched in OP) I
> think adding a pipelining based concurrency component to JDK would be
> reasonable (e.g. something like Disruptor).

Cool, advocate the need for that implementation with better due
diligence (see above), push it through the community process (e.g. JEP),
and actually work on coding/adjusting it for JDK. This would be real
community work :)


I don't get the idea of doing work for free, sorry :-). I share observations/code that I have done 'anyway', but this is different from doing a large amount of work for free unrelated to a current (paid) project. However I'd be willing to participate in such a JEP.
Regarding 'due diligence': I you think there is something fundamentally flawed in the "sequenctial FJ" example, show me. As I not consider myself being a fail-safe-java-rockstar: Maybe there is a flaw or overseen fundamental systematic issue, you can cut&paste&run those 50 lines of code. You need at least 8 (real) cores to get into the interesting zone. 
But I disagree on "missing errorbars" invalidate the result of a test showing up huge differences.

On a side node (also to Mr Harned): We are all here to gain insight, there is no point in devaluing each others work or insulting other people. 

Martin Thompson

unread,
Apr 18, 2014, 7:56:05 AM4/18/14
to mechanica...@googlegroups.com
On 18 April 2014 12:24, Rüdiger Möller <moru...@gmail.com> wrote:

On a side node (also to Mr Harned): We are all here to gain insight, there is no point in devaluing each others work or insulting other people.

+1 

Here here! 

Rüdiger Möller

unread,
Apr 18, 2014, 8:00:45 AM4/18/14
to mechanica...@googlegroups.com

Am Freitag, 18. April 2014 12:03:03 UTC+2 schrieb Martin Thompson:
I'm not interested in a benchmark pissing contest. Benchmarks are great but only when comparing like-for-like. Maybe my little brain cannot see where I best use them in typical applications and I'm just too comfortable with existing tool kit of pipelines, actors, and work pools; each used where appropriate.


Disagree, call it an experiment, not "benchmark" to compare different approaches when solving a given application load pattern. As one can see in case of Akka, choosing an improper one does hurt.  
 

Martin Thompson

unread,
Apr 18, 2014, 8:22:44 AM4/18/14
to mechanica...@googlegroups.com
Fair comment. Benchmarks could be considered a subset of a wider experimental approach. Experiments and scientific directed discovery is good and should be encouraged. We just need to take caution in reading too much into results when we are not comparing like-for-like. 


--

Edward Harned

unread,
Apr 18, 2014, 2:48:34 PM4/18/14
to mechanica...@googlegroups.com


Rüdiger Möller: I know the protocol is to go after the object not the man. However, in any investigation there comes a time when one needs to look at the perpetrators. These articles have been around for four years now scrutinizing the object in detail and today most people don’t even want to look at the crime scene much less the participants.

The professor made an honest mistake in repeatedly trying to port Cilk to Java but Oracle has gravely exacerbated that mistake by using the F/J framework as their parallel engine in Java8. Perhaps they were looking for a cheap, readily available multitasking program rather than building one themselves. In any case, the framework has serious flaws that have the potential to cause harm in the coming years.

ed

Robert Engels

unread,
Apr 18, 2014, 4:14:45 PM4/18/14
to mechanica...@googlegroups.com
I promise this is not going to degenerate as my previous post.... :)

Interesting thread. It seems to me that a common concern here is the sizing of the thread/pools properly, and a worry about thrashing by over allocating threads.

In our framework, we utilize native real-time thread priorities on a per-queue basis (and each queue is partitioned), since for us, some work is "more important" than others. By using native thread priorities you can insulate yourself from many of the problems as each 'work queue' can have as many threads as needed (based on IO blocking needs, and/or number of cores), then the OS prevents the thrashing by ensuring the high priority work doesn't get affected by the low. When running on a kernel with real-time support (scheduling and preemption), I think you see surprisingly effective performance, especially in terms of critical latency.

Rüdiger Möller

unread,
Apr 18, 2014, 9:46:43 PM4/18/14
to
Am Freitag, 18. April 2014 00:06:59 UTC+2 schrieb Edward Harned:
If we're done with benchmarks. Let me respond to the original subject: evidence based debate on the design of the F/J framework.

Fork/Join means to split the work into fragments and join the results. Specifically, fork repeatedly and when done, join the results. This framework doesn’t do that. It uses a recursive technique that is fork-and-join. The join does not, cannot, and will never work outside of a closed environment. The professor tried and tried again to make it work in the open environment of Java but it is fatally flawed; it can never work. Therefore, the framework’s basic design is a failure.


I don't understand your complaint in depth, can you add more detail ? What do you mean with "closed environment" ? What kind/type of failure would you predict ?  

Richard Warburton

unread,
Apr 19, 2014, 3:34:59 AM4/19/14
to mechanica...@googlegroups.com
Hi,

However in real systems the dominant use-case is the latter (e.g. Akka). Another example: F/J would is very fast at sorting a 100 million array of elements, but slow when I have to sort 10 million arrays of length 10 without being allowed to batch them in large chunks.

I've been following this discussion from the sidelines and I think that there's a lot of assumptions being fired around about popularity of use cases. I can think of a fair few examples where people have jobs that are suited to data-parallelism and jobs where things aren't suited to data parallelism and where a task parallel approach would be better.

Sitting on a mailing list denying that data parallel problems don't exist in real systems is the kind of claim that you would really need to do a proper empirical analysis in order to convince me of. If you've done such an analysis then by all means link me to it.

If you haven't done that then a much more constructive approach would be to take a look at core Java APIs and identify things that need to be fixed to support pipeline based task parallelism. To me this is an actual issue, unlike the introduction of support for data parallel operations on collections.

Here's an example of the kind of thing I mean: JDBC is an inherently synchronous design. Its use doesn't fit at all well into a pipelined/reactive/async world view, although it does work fine in a blocking servlet-container model. I appreciate that a lot of the HFT people on this list are immediately rolling their eyes at the use of a SQL database but this is something that's commonly used and if you're wanting to evolve Java its a viable target for improvement.

regards,

  Richard Warburton

Michael Barker

unread,
Apr 19, 2014, 3:45:48 AM4/19/14
to mechanica...@googlegroups.com
 
However in real systems the dominant use-case is the latter (e.g. Akka). Another example: F/J would is very fast at sorting a 100 million array of elements, but slow when I have to sort 10 million arrays of length 10 without being allowed to batch them in large chunks.

[citation needed]
 
Sitting on a mailing list denying that data parallel problems don't exist in real systems is the kind of claim that you would really need to do a proper empirical analysis in order to convince me of. If you've done such an analysis then by all means link me to it.

+1 

Rüdiger Möller

unread,
Apr 19, 2014, 6:59:49 AM4/19/14
to mechanica...@googlegroups.com
see inline comments

Am Samstag, 19. April 2014 09:34:59 UTC+2 schrieb Richard Warburton:
I've been following this discussion from the sidelines and I think that there's a lot of assumptions being fired around about popularity of use cases. I can think of a fair few examples where people have jobs that are suited to data-parallelism and jobs where things aren't suited to data parallelism and where a task parallel approach would be better.

Sitting on a mailing list denying that data parallel problems don't exist in real systems is the kind of claim that you would really need to do a proper empirical analysis in order to convince me of. If you've done such an analysis then by all means link me to it.


1. Unfortunately there is no study and there probably won't be one any time soon. So we have to fall back on reporting what we see in large enterprises or amongst different consulting projects. See this thread as 'part of a survey' instead of calling for one :-)
2. I did not deny the existence of "data parallel" problems (assuming you mean those for which FJ is well suited), I just think they are not that common.
Given that data paralellism has extensive support in the JDK, wouldn't it be up to you to link a study indicating there is no need for a pipelining/actor framework ?
Just out of my head of +15 years of consulting only one or two use cases come in to my mind. 

Examples: 
- (JS backed) SPA's webapps reduce the amount of work on the server side a lot. Frequently a request results in executing some 'if' or hashmap lookup.
- event sourced/reactive designs become more popular
- non-IO bound webservices
- use of NoSql 'databases', in memory data grids (powered by async interfaces).

All of these are usecases where the pure scheduling/queuing of requests/events regulary dominates execution time of "real logic". If you dig in the frameworks powering those apps you'll find insane attempts at workarounding current concurrency style (e.g. serialized continuations to transform massive multithreading into a series of events).

3. always factor in the language barrier. As a non-native speaker its always hard to figure out the amount of 'boldness' of a statement, its really hard to figure out the nuances (cultural differences of what's 'insulting' add to that). Probably one of the reasons internet discussion go up in to flames frequently :-)
 
If you haven't done that then a much more constructive approach would be to take a look at core Java APIs and identify things that need to be fixed to support pipeline based task parallelism. To me this is an actual issue, unlike the introduction of support for data parallel operations on collections.


I wouldn't see the need to 'fix' existing API, but I think having a basic lightweight actor implementation and a basic pipelining framework just would make the concurrency API's more feature complete and would kind of standardize things. Those implementations (imo) would have to focus on a low per-event/scheduling overhead.
 
Here's an example of the kind of thing I mean: JDBC is an inherently synchronous design. Its use doesn't fit at all well into a pipelined/reactive/async world view, although it does work fine in a blocking servlet-container model. I appreciate that a lot of the HFT people on this list are immediately rolling their eyes at the use of a SQL database but this is something that's commonly used and if you're wanting to evolve Java its a viable target for improvement.


*rolleyes* :-). I think for IO bound applications the current Thread Pool based design are ok'ish. However I see a shift of server load patterns driven by javascript-backed 'fat' browser clients (reduces per request work drastically), stuff like long polling/websockets and NoSQL/DataGrids offering async API's.   

Rüdiger Möller

unread,
Apr 19, 2014, 7:05:28 AM4/19/14
to mechanica...@googlegroups.com
Am Samstag, 19. April 2014 09:45:48 UTC+2 schrieb mikeb01:
 
However in real systems the dominant use-case is the latter (e.g. Akka). Another example: F/J would is very fast at sorting a 100 million array of elements, but slow when I have to sort 10 million arrays of length 10 without being allowed to batch them in large chunks.

[citation needed]

scroll upward, replace 'sorting' by 'computing Pi'. If you think the experiment has breaking flaws (besides missing error bars) pointing them out is more than welcome.
 
 
Sitting on a mailing list denying that data parallel problems don't exist in real systems is the kind of claim that you would really need to do a proper empirical analysis in order to convince me of. If you've done such an analysis then by all means link me to it.

+1 

must have been another thread ;) 

Robert Engels

unread,
Apr 19, 2014, 9:01:17 AM4/19/14
to mechanica...@googlegroups.com
Why would someone use FJ for 10 million 10 entry arrays? There is nothing to FJ. You would use a simple partitioned work queue.

Rüdiger Möller

unread,
Apr 19, 2014, 10:08:06 AM4/19/14
to

Am Samstag, 19. April 2014 15:01:17 UTC+2 schrieb Robert Engels:
Why would someone use FJ for 10 million 10 entry arrays? There is nothing to FJ. You would use a simple partitioned work queue.

Exactly, there is nothing to FJ, as FJ addresses a specific problem domain only.
The array example is fuzzy as its not a diamond pattern anymore (consider something like having to add up the size of all the tiny arrays after sorting). But anyway the overhead of scheduling/enqueing using JDK's threadpools is too high (+contention when summing up). It does not scale. 
Best scaling results can be achieved with disruptor/pipelining here, that's what I wanted to show. JDK is missing a basic concurrency building block which cannot be replaced by pools or even F/J 
 

Robert Engels

unread,
Apr 19, 2014, 10:31:39 AM4/19/14
to mechanica...@googlegroups.com
Correct, but it is trivial to create a partitioned pool with the standard jdk classes.

On April 19, 2014 8:37:44 AM CDT, "Rüdiger Möller" <moru...@gmail.com> wrote:

Am Samstag, 19. April 2014 15:01:17 UTC+2 schrieb Robert Engels:
Why would someone use FJ for 10 million 10 entry arrays? There is nothing to FJ. You would use a simple partitioned work queue.

The array example is fuzzy as its not a diamond pattern anymore (consider something like having to add up the size of all the tiny sorted arrays). But anyway the overhead of scheduling/enqueing using JDK's threadpools is too high. You won't see any scaling effects. 
 

It is loading more messages.
0 new messages