Linus having a rant about parallel computing

Martin Thompson

unread,

Jan 1, 2015, 4:41:22 AM1/1/15

to mechanica...@googlegroups.com

http://highscalability.com/blog/2014/12/31/linus-the-whole-parallel-computing-is-the-future-is-a-bunch.html

The re-posted comments in the article are worth a read.

Happy New Year Everyone!

Martin...

Jan van Oort

unread,

Jan 1, 2015, 4:57:53 AM1/1/15

to mechanical-sympathy

Great. 2015 starts exactly the way it should: with a Linus rant, forcing us to think ( yet ) more deeply. Glad I didn't ingest too much bubbly liquid last night.

Happy New Year !

Fortuna audaces adiuvat - hos solos ?

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kirk Pepperdine

unread,

Jan 1, 2015, 6:20:47 AM1/1/15

to mechanica...@googlegroups.com

If you think about it the most useful frameworks are pooling and caching techniques so… I’m not sure he’s all that wrong.. for the moment. Once we’ve better tools (languages and methods) to support parallelism I think we’ve got an “everything thats going to be invented has been invented” rant.

Happy New Year!

Kirk

signature.asc

Jan van Oort

unread,

Jan 1, 2015, 6:36:14 AM1/1/15

to mechanical-sympathy

Linus may have overseen one thing: Gustafson's Law, aka "Parallelizing becomes more efficient with larger problem sizes"

Fortuna audaces adiuvat - hos solos ?

Eric DeFazio

unread,

Jan 1, 2015, 10:23:45 AM1/1/15

to mechanica...@googlegroups.com

I'm of the camp that believes the advances made in processor performance (i.e. CPU pipe-lining) over the last decade or so has allowed inefficient code to perform "good enough" so that the quality of the average program (and the "depth" of skill of the average programmer) has gone down. (for instance, the argument to "upgrade" to more memory and or "faster hardware" was one that I've seen "win" since custom software is expensive relative to the cost of hardware...)

Basically the software industry has gotten good at marketing and selling (pretty bad) bloated and slow tools/applications/frameworks/libraries/APIs/languages that are "easy to use" and tools to let novice programmers just configure/plug things in/hook things up. In short it's gotten much easier to whip up a company website or perform some business logic by hooking together many "generic" solutions with relatively junior developers.

On the other hand, the skills required for building and taking advantage of the more complicated (memory efficient/sensitive/critical) applications that require ridiculous data throughput (High Frequency Trading/ Shotgun Genome Assembly/ Real Time Big Data Visualization) has gone up given the complexities of new computer architectures. (we can no longer think about "memory" and "cache" the same way as we used to.)

So there is going to be a larger and larger divide of performance and efficiency between the "cream of the crop" types of applications where more elite developers can adapt to know more "mechanical sympathy" and learn to take advantage of all (cache, cores, GPGPUs) they have at their disposal, verses the average/ run of the mill application (where the business difference between a page turn of 1 second and .8 seconds is negligible).

In short, I can understand why people say "4 cores is enough, just add more cache, 'cause my programs aren't very parallel", because (probably most programs WONT be developed to take advantage of 100 cores in the next decade or so)... On the other hand when you do have "data bound" problems that require programs that take advantage of GPGPU, or hundreds of cores, these will push the industry in the massively parallel direction. (So perhaps we'll see more divergence between "Server-based" hardware and "Desktop-based" hardware (or "tablet based" which seems to be happening with new trends towards power efficiency.)

My 2 cents (happy new year everyone)

Kirk Pepperdine

unread,

Jan 2, 2015, 1:01:13 AM1/2/15

to mechanica...@googlegroups.com

IME, more power means being able to tackle problems that are simply too large to tackle. There is always… initially, a limited number of those too large to tackle problems but then they come. Distributed computing simply wasn’t attempted by most teams until quite recently.. now it’s second hand.. not even news.. but before the attitude was, why would I do that.. look at all the problems I take on that I don’t when I can do everything locally in a stove-piped system. And initially, all problems looked like stove-pipes… so, people who said that.. at that time… weren’t wrong but of course now they are. But we now have better tools (languages and methods) and so...

Regards,

Kirk

signature.asc

Martin Thompson

unread,

Jan 2, 2015, 8:08:03 AM1/2/15

to

Bigger caches are of limited value once the working set is larger than the cache size[1]. In the low-latency space we often do crazy things to keep the entire application in cache but this is not the mainstream. Moving to large pages and having L2 support for these large pages can be more significant than actual cache size as we can now see for some large memory applications running on Haswell.

Rather than going parallel it is often more productive moving to cache friendly or cache oblivious (actually very cache friendly) data structures. It is very easy to make the argument that if a small proportion of the effort that went into Fork-Join and parallel streams was spent on providing better general purpose data structures, i.e. Maps and Trees, that are cache friendly then mainstream applications would benefit more than they do from FJ and parallel streams that are supposedly targeted at "solving the multi-core problem". It is not that FJ and parallel streams are bad. They are really an impressive engineering effort. However it is all about opportunity cost. When we choose to optimise we should choose what gives the best return for the investment.

A lot can be achieved with more course grain parallelism/concurrency. The Servlet model is a good example of this, or even how the likes of PHP can scale on the server side. Beyond this pipeling is often a more intuitive model that is well understood and practised extensively by our hardware friends.

When talking about concurrent access to data structures it is very important to separate query from mutation. If data structures are immutable, or support concurrent non-blocking reads, then these can scale very well in parallel and can be reasoned about. Concurrent update to any remotely interesting data structure, let alone full model, is very very complex and difficult to manage. Period. Leaving aside the complexity, any concurrent update from multiple writers to a shared model/state is fundamentally limited as proven by Universal Scalability Law (USL)[2]. As an industry we are kidding ourselves if we think it is a good idea to have concurrent update from multiple writers to any model when we expect it to scale in our increasing multi-core world. The good thing is that the majority of application code that needs to be developed is queries against models that do not mutate that often.

A nasty consequence of our industry desire to have concurrent access to shared state is that we also do it synchronously and that spreads to a distributed context. We need to embrace the world as being asynchronous as a way to avoid the latency constraints in our approaches to design and algorithms. By being asynchronous we can be non-blocking and set our applications free to perform better and be more resilient due to enforced isolation. Bandwidth will continue to improve at a great rate but latency improvements are levelling off.

My new years wishlist to platform providers would be the infrastructure to enable the development of more cache friendly, immutable, and append only data structures, better support for pipeline concurrency, non-blocking APIs (e.g. JDBC), language extensions to make asynchronous programming easier to reason about (e.g. better support for state machines and continuations), and language extensions for writing declarative queries (e.g. LINQ[3] for C# which could be even better). Oh yes, and don't be so shy about allowing lower level access from the likes of Java, we are well beyond writing applets that run in browser sandboxes these days!

Martin...

[1] - http://lwn.net/Articles/252125/

[2] - http://www.perfdynamics.com/Manifesto/USLscalability.html

[3] - http://msdn.microsoft.com/en-gb/library/bb397926.aspx

Eric DeFazio

unread,

Jan 2, 2015, 9:13:58 AM1/2/15

to mechanica...@googlegroups.com

Hey Kirk, I dunno if you got a chance to see Guy Steele's talk about Parallelism, but I think this would be a huge step in the right direction (if we were to "express" programs differently and allow the language/runtime to run things in parallel.)

Anyways check it out:

http://www.infoq.com/presentations/Thinking-Parallel-Programming

Its a wonderful talk, and he begins the main points of the talk about minute 29:00 or so) (prior to that its interesting about him talking about building an efficient program (but its the "worst thing he ever wrote")

Cheers

Eric

Camille Fournier

unread,

Jan 2, 2015, 10:41:58 AM1/2/15

to mechanica...@googlegroups.com

"I'm of the camp that believes the advances made in processor performance (i.e. CPU pipe-lining) over the last decade or so has allowed inefficient code to perform "good enough" so that the quality of theaverage program (and the "depth" of skill of the average programmer) has gone down."

I'm pretty sure we were saying that 10 years ago too ;) I would actually argue that most of the CPU gains occurred in the late 90s and early 00s, and the past 10 years have been focused on additional cores/CPUs, memory improvements, flash/SSD of course, and general support for virtualization. We've been living in the as good as it's gonna get for single thread processor-bound performance world in may ways for a long time, and yet we keep producing useful, valuable software. We may all be contributing to the heat death of the universe (or the overheating of the planet) but this is what we needed for software to eat the planet.

This comment from the original post made me smile

"And nobody sane would make the cores smaller and weaker in order to fit more of them - the only reason to make them smaller and weaker is because you want to go even further down in power use, so you'd still not have lots of those weak cores."

*waves at Gil* we were all insane once ;)

--

Kirk Pepperdine

unread,

Jan 3, 2015, 5:17:05 AM1/3/15

to mechanica...@googlegroups.com

I’ve seen Guy talk before but this looks interesting… thanks for the link.

I have to agree with Martin that the biggest high performance challenge is simply getting data to the CPU. I’m also becoming more of the opinion that parallel streams are cool but of limited use. As you may (or may not know), my past time for the last couple of years has been tinkering with Censum, my GC log analysis tooling. I’ve recently tried moving some of the code to use streams and even though this seems like it should be the perfect use case for streams, I can’t use them. I can use them in a pure desktop app but I don’t want to limit Censum to that environment and every other use case for the query engine simply precludes using streams. Also this dogma to make everything immutable simply hurts. Most data structures are not naturally immutable and to artificially make the immutable (as is done under the covers with Lambdas) comes with a costs. Don’t get me wrong, I use immutable data structures however….

And yeah, we’ve been saying for more than 20 years.. as hardware gets better we find better ways of abusing it… hence my some what infamous debate lunch time debate with Rich Hickey in Aarhus a while about Clojure’s usefulness in high performance computing environments. He had another round with Cliff Click and which finally led to some mutable data structures in Clojure. Should I start a flame war by mentioning Scala? It’s Jan 3rd and I’ve not seen a good flame war yet… ;-)

Regards,

Kirk

signature.asc

Martin Thompson

unread,

Jan 3, 2015, 7:44:56 AM1/3/15

to mechanica...@googlegroups.com

I have to agree with Martin that the biggest high performance challenge is simply getting data to the CPU. I’m also becoming more of the opinion that parallel streams are cool but of limited use. As you may (or may not know), my past time for the last couple of years has been tinkering with Censum, my GC log analysis tooling. I’ve recently tried moving some of the code to use streams and even though this seems like it should be the perfect use case for streams, I can’t use them. I can use them in a pure desktop app but I don’t want to limit Censum to that environment and every other use case for the query engine simply precludes using streams. Also this dogma to make everything immutable simply hurts. Most data structures are not naturally immutable and to artificially make the immutable (as is done under the covers with Lambdas) comes with a costs. Don’t get me wrong, I use immutable data structures however….

Maybe I'm being slow this Saturday morning but how do Lambdas make data structures immutable under the covers? They are simply functions that can do a restricted capture as far as I understand. Lambdas in Java can capture a reference by value but what they reference can be mutable and often are.

I've been using immutable data structures for some time and have a love hate relationship with them. They really do simplify the programming model and some classes of problem become really elegant. For example, dealing with anything that does a delta comparison for before-and-after or incremental change. There are lots of use cases for this like what to redraw in a UI or how to model changes in a financial market.

There are some fundamental issues with immutable data structures as most have trees underpinning them that must be transformed by path copy from leaf to root. This summons the twin performance demons of pointer indirection, and GC as a result of the allocation that tends to not fit the weak generational hypothesis. However these demons can be tamed with control of memory layout and truly concurrent garbage collectors. Gil has got close to giving us the later and is helping provide a solution to the former.

Imagine how much better an implementation of the B-tree family of data structures can be if you can co-locate the data in each node and get to it by dead reckoning, plus the added benefit of how much more efficient the path copy operation can be when the data is laid out contiguously. Cache oblivious designs can be employed.

There are a lot of benefits to be had if we can address the indirection and GC issues. The resultant tree implementations could then be a much better fit for efficient parallel query operations the the likes of Guy Steele often talks about, and the likes of Dataomic could be implemented so that it becomes a very practical real-world data store.

And yeah, we’ve been saying for more than 20 years.. as hardware gets better we find better ways of abusing it… hence my some what infamous debate lunch time debate with Rich Hickey in Aarhus a while about Clojure’s usefulness in high performance computing environments. He had another round with Cliff Click and which finally led to some mutable data structures in Clojure. Should I start a flame war by mentioning Scala? It’s Jan 3rd and I’ve not seen a good flame war yet… ;-)

Like the quote about the last 20 years :-)

Some things work better as mutable, other things work better as immutable. My general rule of thumb is all facts and storage should be considered immutable, and the working set of a model can be mutable but ONLY from a single writer. Cases for multiple writers mutating any data structure should be considered very very carefully. Private unshared state you can do whatever you like with provided it stays behind closed doors.

Martin...

Richard Warburton

unread,

Jan 3, 2015, 12:00:35 PM1/3/15

to mechanica...@googlegroups.com

Hi,

Maybe I'm being slow this Saturday morning but how do Lambdas make data structures immutable under the covers? They are simply functions that can do a restricted capture as far as I understand. Lambdas in Java can capture a reference by value but what they reference can be mutable and often are.

I suspect what was being alluded to here was that the streams API which was introduced with lambdas won't modify the data source, but returns you a new value which has been computed from the data source. This is introducing a more functional style of programming (since it doesn't have side effects) without adding immutable collections.

regards,

Richard Warburton

http://insightfullogic.com

@RichardWarburto

Martin Thompson

unread,

Jan 6, 2015, 7:21:42 AM1/6/15

to mechanica...@googlegroups.com

Maybe I'm being slow this Saturday morning but how do Lambdas make data structures immutable under the covers? They are simply functions that can do a restricted capture as far as I understand. Lambdas in Java can capture a reference by value but what they reference can be mutable and often are.

I suspect what was being alluded to here was that the streams API which was introduced with lambdas won't modify the data source, but returns you a new value which has been computed from the data source. This is introducing a more functional style of programming (since it doesn't have side effects) without adding immutable collections.

Thanks Richard. Doing anything significant in a function style without efficient immutable collections is likely to be a serious performance bottleneck.

Reply all

Reply to author

Forward