STM and persistent data structures performance on mutli-core archs

Andy C

unread,

Mar 13, 2014, 12:58:54 PM3/13/14

to clo...@googlegroups.com

Hi,

So the other day I came across this presentation: http://www.infoq.com/presentations/top-10-performance-myths

The guy seems to be smart and know what he talks about however when at 0:22:35 he touches on performance (or lack of thereof) of persistent data structures on multi-core machines I feel puzzled.

He seems to have a point but really does not back it with any details. There is also a claim that STM does not cooperate well with GC. Is it true?

Thanks,

Andy

Timothy Baldridge

unread,

Mar 13, 2014, 2:24:29 PM3/13/14

to clo...@googlegroups.com

I talked to Martin after a CodeMesh, and had a wonderful discussion with him about performance from his side of the issue. From "his side" I mean super high performance. You have to get a bit of background on some of his work (at least the work he talks about), much of what he does is high throughput trading systems. In these systems you will often see small messages (50 bytes or so) and the the goal is to process them as quickly as possible.

This means that almost everything is secondary to throughput. So let's take that generalization and apply it to the more common work I or people like me do...here we care mostly about programmer productivity, and making simpler systems. Here immutability and STM like concepts (I'm dumping STM in with atoms here) work pretty well.

At one point in our conversation Martin mentioned that he was able to get about 100,000 messages through his system a second. That's awesome! How often do you have 100k messages in your system?

Everything has tradeoffs. One more example was when Martin explained how he was able to get a 10x performance boost by pinning a JVM to each socket in a server, then pinning that JVM's memory to the RAM sockets closest to that CPU socket. Then he carefully setup shared memory message passing via burning one core on each CPU to monitor the queues, and moving messages only in the directions directly supported by the Hyper Transport bus. Once again, that's amazing...once again I've never needed that to ship a product.

Sadly, other parts of that talk are very true. Many programmers I've talked to don't understand the difference between a hash map and a list, let alone understand when they would use them. So I'll fight that battle, and Martin can fight his, and I'll apply his work to my problems as the solution fits.

So that's the way I view it. I don't claim to know much at all about Martin's work, I just know enough to know that I don't deal with those problems. Most servers I work on are idle at least 50% of the time. And when they do get overloaded I can just startup more machines.

Timothy

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
“One of the main causes of the fall of the Roman Empire was that–lacking zero–they had no way to indicate successful termination of their C programs.”
(Robert Firth)

Andy C

unread,

Mar 13, 2014, 11:59:00 PM3/13/14

to clo...@googlegroups.com

On Thu, Mar 13, 2014 at 11:24 AM, Timothy Baldridge <tbald...@gmail.com> wrote:

I talked to Martin after a CodeMesh, and had a wonderful discussion with him about performance from his side of the issue. From "his side" I mean super high performance.

[...]

Hi Tim,

Thanks for explaining the context of Martin's work - I did not know that it was so advanced. I did some research back in good times on http://en.wikipedia.org/wiki/Connection_Machine (via telnet 56kb/s telnet connection from east Europe to US - that was super cool back then) and can only atest to what he says in the presentation. You can achieve amazing levels of performance provided that patterns of data flow and control in software match exactly parallel nature of given architecture.

However his point is somewhat more serious. He directly "attacks" a premise that using immutable data is inherently mutli-core friendly which comes from the deduction that writing reactive software reduces the use locks and guards hence enables more flow and less waiting. Now his point is that GC acts a super GIL which effectively kills all the hard work done on the language and application design level.

Now, I wish Marin eats his own dog food and as pointed out numerous times in the presentations, he backs up his points with a real experiments and data. At least it was not apparent wether his conclusions were purely theoretical or grounded in some experience.

I am in the process of transitioning from Scala to Clojure ecosystem (just finished SICP videos and have some hard 4clojure problems behind me, but a lot of to learn) so not yet fluent in all aspects of the language but I think at some point I will try to write some simulations of STM performace following some of Martin's intuitions. I think that it will be very cool.

Best,
Andy

Raoul Duke

unread,

Mar 14, 2014, 12:58:23 PM3/14/14

to clo...@googlegroups.com

cough cough erlang cough ahem

Эльдар Габдуллин

unread,

Mar 14, 2014, 1:01:13 PM3/14/14

to clo...@googlegroups.com

He talks about simple things actually.

When you have any sort of immutable data structure and you want to change it from multiple threads

you just must have a mutable reference which points to the current version of that data structure.

Now, updates to that mutable reference are fundamentally serial. Whatever synchronization

strategy you chose been that optimistic updates (atom) or queuing (agent) or locks you inevitably

will have a contention on a large number of threads. When you will run on that you will also

have hundred ways to solve a problem.

There is nothing magical about persistent data structures on multi-core machines :)

четверг, 13 марта 2014 г., 20:58:54 UTC+4 пользователь Andy C написал:

Gary Trakhman

unread,

Mar 14, 2014, 1:01:11 PM3/14/14

to clo...@googlegroups.com

On Fri, Mar 14, 2014 at 12:58 PM, Raoul Duke <rao...@gmail.com> wrote:

cough cough erlang cough ahem

Care to elaborate? :-)

Raoul Duke

unread,

Mar 14, 2014, 1:18:28 PM3/14/14

to clo...@googlegroups.com

>> cough cough erlang cough ahem
> Care to elaborate? :-)

"Now his point is that GC acts a super GIL which effectively kills all
the hard work done on the language and application design level."

erlang's approach to gc / sharing data / multi process / smp is i
think an interesting sweet spot. there sure as heck is not a GIL
effect by the GC.

Gary Trakhman

unread,

Mar 14, 2014, 1:23:02 PM3/14/14

to clo...@googlegroups.com

Shared memory is kind of a degenerate case of message-passing when you think about cache-coherency protocols and such. I'd argue erlang's message-passing isn't inherently superior, but kind of obfuscates the issue.

What's the sequential fraction of an arbitrary erlang program, can you even know (I don't know erlang, so I'm honestly asking)?

Shared memory pretty darn convenient, and we don't have hundred-core+ boxes yet.

Raoul Duke

unread,

Mar 14, 2014, 1:35:14 PM3/14/14

to clo...@googlegroups.com

i am probably out of my depth here, i do not have extensive real-world
experience with the various ways to approach parallelism and
concurrency (to be distinguished of course), more run of the mill
stuff. so if i sound like i'm missing your point or am clueless i ask
for your patience :-)

> What's the sequential fraction of an arbitrary erlang program, can you even
> know (I don't know erlang, so I'm honestly asking)?

who cares? or rather, each person has to only care about their own
program & situation. maybe their stuff fits erlang. maybe it fits
better with something else e.g. LMAX. it. all. depends. :-)

everything depends on context. Martin's talk even included the part
where he bemoaned that people don't just stay single-threaded and fix
their crappy code first. running to concurrency and parallelism is
often a cop-out the way i hear him. that could be seen as arguing
'against' erlang.

so there are places where your program is mostly sequential and things
like "does the GC act like a GIL" do not matter as much as the
situation where you are trying to be more concurrent + parallel but
not distributed. in those sequential situations, maybe erlang becomes
a square peg for the round hole. (although i personally, through
suffering as a maintenance programmer, am a *huge* lover of the
recursive single assignment "turn" based approach to things, and i
love clojure's idea of having a consistent view of the world; most OO
people shoot me in the foot a year after they've left the company,
with their crappy macaroni code.)

> Shared memory pretty darn convenient, and we don't have hundred-core+ boxes
> yet.

i'm confused in that i thought you wrote shared memory ~= message
passing. so why talk about shared memory when that is a lie? just like
Martin said, RAM is a lie. why not realize everything is really
message passing in the long run, model it as such, understand the
issues as such? i do not have a chip on my shoulder about this, i'm
just sounding it out / exploring the thought.

sincerely.

Gary Trakhman

unread,

Mar 14, 2014, 1:46:12 PM3/14/14

to clo...@googlegroups.com

Well, yes it's a lie, it's an abstraction, we love those as programmers :-). The question is, is it worth the trouble and under what conditions?

IMO, we don't understand message passing well enough to be great at programming or characterizing distributed concurrent systems. I wonder how this changes when you're writing erlang. I would guess that certain classes of programs are really easy to write in erlang compared to other languages, and I wonder at what cost?

Clojure appealed to me because I was studying computer architecture in grad school, and I think it has a great answer for cores<100 shared-memory concurrency, but at some point I acknowledge that might not be good enough.

I'm very interested in Netkernel's approach, actually: http://www.1060research.com/products/

It sort of marries the notion of immutable data with distribution and message passing. URLs as pointers.

sincerely.

Gary Trakhman

unread,

Mar 14, 2014, 2:15:07 PM3/14/14

to clo...@googlegroups.com

"Everything has tradeoffs. One more example was when Martin explained how he was able to get a 10x performance boost by pinning a JVM to each socket in a server, then pinning that JVM's memory to the RAM sockets closest to that CPU socket. Then he carefully setup shared memory message passing via burning one core on each CPU to monitor the queues, and moving messages only in the directions directly supported by the Hyper Transport bus. Once again, that's amazing...once again I've never needed that to ship a product. "

This is absolutely fascinating to me. The ability to do this sort of thing is a tradeoff from using abstractions (threads, OS threads, Java threads) that closely match or can be massaged to match or 'have sympathy' for the hardware realities. I think this can get lost when we stray too far.

Raoul Duke

unread,

Mar 14, 2014, 2:20:07 PM3/14/14

to clo...@googlegroups.com

> that closely match or can be massaged to match or 'have sympathy' for the
> hardware realities. I think this can get lost when we stray too far.

i wish this were somehow more modeled, composed, and controllable up
in our ides and source code, rather than being esoteric tweaky options
in the various inscrutable layers. i want a futuristic JIT that does
this kind of stuff for me :-) :-)

Gary Trakhman

unread,

Mar 14, 2014, 2:22:42 PM3/14/14

to clo...@googlegroups.com

Yes I agree, that was my next thought. I wish we could do these knobs in a single JVM.

guns

unread,

Mar 14, 2014, 2:39:16 PM3/14/14

to clo...@googlegroups.com

On Fri 14 Mar 2014 at 02:15:07PM -0400, Gary Trakhman wrote:

> "Everything has tradeoffs. One more example was when Martin explained how
> he was able to get a 10x performance boost by pinning a JVM to each socket
> in a server, then pinning that JVM's memory to the RAM sockets closest to
> that CPU socket. Then he carefully setup shared memory message passing via
> burning one core on each CPU to monitor the queues, and moving messages
> only in the directions directly supported by the Hyper Transport bus. Once
> again, that's amazing...once again I've never needed that to ship a
> product. "
>
> This is absolutely fascinating to me. The ability to do this sort of thing
> is a tradeoff from using abstractions (threads, OS threads, Java threads)
> that closely match or can be massaged to match or 'have sympathy' for the
> hardware realities. I think this can get lost when we stray too far.

cf. the C10M project:

http://c10m.robertgraham.com/p/blog-page.html

tl;dr
The kernel gets in the way of extreme hi-throughput, so if that's
what you want, don't lean on the kernel to do your packet routing,
thread scheduling, etc.

guns

Andy C

unread,

Mar 14, 2014, 8:59:45 PM3/14/14

to clo...@googlegroups.com

Maybe one day this idea http://en.wikipedia.org/wiki/Lisp_machine will come back, I mean in a new form ..

In any case, I think that it would be great to see some performance benchmarks for STM

A.

François Rey

unread,

Mar 15, 2014, 4:19:45 AM3/15/14

to clo...@googlegroups.com

On 15/03/14 01:59, Andy C wrote:

Maybe one day this idea http://en.wikipedia.org/wiki/Lisp_machine will come back, I mean in a new form ..

That reminds me of this Urbit project, which is nowhere near usefulness at present, but deserves to be mentioned as a (radical) experiment striving to go back to the roots (ideally hardware) in order to break free from the inconsistencies of having grown separately the mechanical and semantic models. FoNC is another example.

François Rey

unread,

Mar 15, 2014, 8:57:45 AM3/15/14

to clo...@googlegroups.com

Martin's point about immutable and persistent data structures is further developed in his interview on infoq, you can skim to point #9 if you're in a hurry.
Overall what he says is that in terms of scalability of the development activity, immutability and persistence are great ideas, since we don't have to deal with non-deterministic behaviour any more. When one needs to scale the running system, meaning increasing the rate at which the persistent data structure is updated, these can lead to performance issues in various ways:
- longer GC pauses because persistency increases the number of objects that are neither very short-lived nor long-lived,
- contention because the root of the tree of the persistent data structure becomes the focal point of concurrency,
- increased CPU cache misses since persistent data structures are trees that increasingly span larger non-contiguous and non-sequential parts of memory
Of these the last point is probably the most painful, since there's no way to deal with it unless one reconsiders the whole persistent data structure.
In other words increasing the number of threads and cores may eventually lower throughput because the time taken for dealing with these issues (GC pauses, locking, cache misses) grows larger than the time taken for useful computation.
I can't backup any of this with actual data and experience. However I think this old thread about poor performance on multicore does provide a clear picture of the problem, which becomes even clearer with actual stats showing CPUs were 83% idle, i.e. waiting for memory.

Also one should view Martin's other vidoes on infoq to get a better understanding of his arguments. He's actually quite positive about Clojure in general. It's just that depending on the scalability and performance requirements, persistent data structures may not provide a satisfactory answer and could even lower throughput.

Softaddicts

unread,

Mar 16, 2014, 1:24:29 PM3/16/14

to clo...@googlegroups.com

I agree with most of his arguments but not all of them.

Memory subsystems have always been a major concern.
Since 35 years ago, many designs went through simulation before
burning anything on chip. Especially SMP designs with shared memory
given the cost of prototyping.

Simulations used "typical" workloads to identify which strategies should
be implemented in hardware. Shared memory sub-systems proved to be the
main bottleneck every time.

Any choices of strategies could result in bottlenecks if the payload differed
somewhat from the "average" simulated ones.
Threads appeared much more later and this alone makes you wonder about
the pressure increase on shared memory sub-systems.

The numbers Martin claims as speed increases (myth #x,
CPUs are not getting faster) are simply better/worse hardware strategy choices
which decreased idle time.
Manufacturers are doing the same job as decade ago, simulating and finding
how to optimize as much as possible typical workloads.

We cannot break the physical barriers anymore, we need a technology
upgrade (atom level, quantum level, whatever). But such a change is
no more elegant than a brute force approach that will reshape
performance and create a new barrier farther away.

Having tuned generated machine code when domestic appliance sized
computers were the norm, I would not like to get back to this age of
analyzing the hardware to twist my software accordingly.

Relying on hardware architectures to craft software is not to me a path
with a bright future. Server hardware changes often enough that such
an approach would mess design and be unsustainable in the long run.

We may end up someday with "tunable" hardware that adapt to
workloads. By tunable I mean giving the hardware instructions on how to
behave. This would be better than the above. We had such options with
traditional compilers.

I think that significant optimizations have to be decided at a higher level.
I doubt that any of that can be implemented at the hardware level alone
and let it decide on the fly. This sounds like magic, too good to be true.

I am also quite convinced that optimizing in hardware single threaded
and muti-threaded processing are antagonist goals given the current
hardware designs.

Martin hits a number of significant nails, we need to be
aware of hardware limitations and we need to measure the impacts of
our choices and change these within some reachable goals.

However achieving 10% or less cpu idle time on a specific server architecture
to me is not a goal.

I'm interested by the constraints I have to met (business and physical ones)
and playing within a large playground to meet these wether it involves
using more powerful/specially designed hardware or using better
software designs.

Luc P.

> Martin's point about immutable and persistent data structures is further
> developed in his interview on infoq

> <http://www.infoq.com/interviews/reactive-system-design-martin-thompson>, you

> can skim to point #9 if you're in a hurry.
> Overall what he says is that in terms of scalability of the development
> activity, immutability and persistence are great ideas, since we don't
> have to deal with non-deterministic behaviour any more. When one needs
> to scale the running system, meaning increasing the rate at which the
> persistent data structure is updated, these can lead to performance
> issues in various ways:
> - longer GC pauses because persistency increases the number of objects
> that are neither very short-lived nor long-lived,
> - contention because the root of the tree of the persistent data
> structure becomes the focal point of concurrency,
> - increased CPU cache misses since persistent data structures are trees
> that increasingly span larger non-contiguous and non-sequential parts of
> memory
> Of these the last point is probably the most painful, since there's no
> way to deal with it unless one reconsiders the whole persistent data
> structure.
> In other words increasing the number of threads and cores may eventually
> lower throughput because the time taken for dealing with these issues
> (GC pauses, locking, cache misses) grows larger than the time taken for
> useful computation.
> I can't backup any of this with actual data and experience. However I
> think this old thread about poor performance on multicore

> <https://groups.google.com/forum/#%21topic/clojure/48W2eff3caU> does

> provide a clear picture of the problem, which becomes even clearer with
> actual stats showing

> <https://groups.google.com/d/msg/clojure/48W2eff3caU/FBFQp2vrWFgJ>CPUs
> <https://groups.google.com/d/msg/clojure/48W2eff3caU/FBFQp2vrWFgJ>were
> 83% idle
> <https://groups.google.com/d/msg/clojure/48W2eff3caU/FBFQp2vrWFgJ>, i.e.

> waiting for memory.
>
> Also one should view Martin's other vidoes on infoq

> <http://www.infoq.com/author/Martin-Thompson> to get a better

> understanding of his arguments. He's actually quite positive about
> Clojure in general. It's just that depending on the scalability and
> performance requirements, persistent data structures may not provide a
> satisfactory answer and could even lower throughput.
>
>
> On 14/03/14 18:01, ?????? ????????? wrote:
> > He talks about simple things actually.
> >
> > When you have any sort of immutable data structure and you want to
> > change it from multiple threads
> > you just must have a mutable reference which points to the current
> > version of that data structure.
> > Now, updates to that mutable reference are fundamentally serial.
> > Whatever synchronization
> > strategy you chose been that optimistic updates (atom) or queuing
> > (agent) or locks you inevitably
> > will have a contention on a large number of threads. When you will run
> > on that you will also
> > have hundred ways to solve a problem.
> >
> > There is nothing magical about persistent data structures on
> > multi-core machines :)
> >

> > ???????, 13 ????? 2014 ?., 20:58:54 UTC+4 ???????????? Andy C ???????:

> >
> > Hi,
> >
> > So the other day I came across this
> > presentation:http://www.infoq.com/presentations/top-10-performance-myths
> > <http://www.infoq.com/presentations/top-10-performance-myths>
> >
> > The guy seems to be smart and know what he talks about however
> > when at 0:22:35 he touches on performance (or lack of thereof) of
> > persistent data structures on multi-core machines I feel puzzled.
> >
> > He seems to have a point but really does not back it with any
> > details. There is also a claim that STM does not cooperate well
> > with GC. Is it true?
> >
> > Thanks,
> > Andy
> >
>

> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clo...@googlegroups.com
> Note that posts from new members are moderated - please be patient with your first post.
> To unsubscribe from this group, send email to
> clojure+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

--
Softaddicts<lprefo...@softaddicts.ca> sent by ibisMail from my ipad!

François Rey

unread,

Mar 17, 2014, 12:06:04 PM3/17/14

to clo...@googlegroups.com

On 16/03/14 18:24, Softaddicts wrote:

I think that significant optimizations have to be decided at a higher level.
I doubt that any of that can be implemented at the hardware level alone
and let it decide on the fly. This sounds like magic, too good to be true.

I am also quite convinced that optimizing in hardware single threaded 
and muti-threaded processing are antagonist goals given the current
hardware designs.

Martin hits a number of significant nails, we need to be
aware of hardware limitations and we need to measure the impacts of
our choices and change these within some reachable goals.

However achieving 10% or less cpu idle time on a specific server architecture
to me is not a goal.

I'm interested by the constraints I have to met (business and physical ones)
and playing within a large playground to meet these wether it involves
using more powerful/specially designed hardware or using better
software designs.

I hear you well. Whether further progress comes from hardware and/or software doesn't matter, I just hope progress is being made somewhere and that we won't have to fiddle with hundreds of parameters as it's the case for the JVM right now. Until then, tedious experimentation will be the only way to ensure scaling does happen.
Sure Martin's case at LMAX was extreme, not many people will have such concerns. But we're in an age where the need for scalability can be very sudden (e.g. popular phone app.). With servers nowadays that can have many CPUs and cores, and many Gb of RAM, one can easily hope that vertical scaling can be sufficient, but it's often not as simple as that. Vertical can even be more dangerous than horizontal, since in the horizontal case you probably architect for scaling right from start. So what works and doesn't work is valuable information, and initiatives such as the Reactive Manifesto are helpful because they provide guidelines.

Luc Prefontaine

unread,

Mar 17, 2014, 12:33:51 PM3/17/14

to clo...@googlegroups.com

I would say that guidelines can help but
curiosity is mandatory.

I used to sleep with books on my
bedside table about hardware design,
operating system internals,
io subsystems, networking, ...

I did this for 15 years after I started
to work with computers.

Aside from work related stuff of course,
this was to increase my general
knowledge and to be more relaxed
on the job :)

Knowing your enemy is essential...

Luc P.

> such as the Reactive Manifesto <http://www.reactivemanifesto.org/> are

> helpful because they provide guidelines.
>

> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clo...@googlegroups.com
> Note that posts from new members are moderated - please be patient with your first post.
> To unsubscribe from this group, send email to
> clojure+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
--

Luc Prefontaine<lprefo...@softaddicts.ca> sent by ibisMail!

Martin Thompson

unread,

Mar 17, 2014, 2:13:59 PM3/17/14

to clo...@googlegroups.com

It is so nice to see a good rational discussion on this subject. Let me try and clarify the points people have picked up on.

Upfront let me say I'm a fan of functional techniques in general and spend a good part of my life educating the OO folk with blinkers that there are many ways to skin a cat. Many of these techniques, not only improve code clarity and help me reason about it, they can also improve performance compared to alternatives.

To the main point, I often hear people say that "immutable data will solve the parallel issue". To explain my findings from working with persistent data structures in the performance space it need to be considered from 2 perspectives.

First perspective is that of code complexity which impacts developer productivity and maintenance. Concurrent access to a persistent data structure is much easier to reason about and not subject to timing issues and race conditions as a mutating data structures often are. This is a great thing and helps with concurrent programming immensely. I've also found persistent data structures also offer features not possible with other approaches, such as the ability to "stop time" and even step backwards and forwards as events change the world.

From the second perspective of the performance characteristics of persistent data structures I've not been so impressed. As has been mentioned elsewhere in this thread I've discovered a number of consequences of these approaches.

Due to the path-copy semantics, the contention gets driven to the root of the tree. This becomes a serialisation point around which Amdahl's law hunts you down with no chance of escape. Mutable data structures can often avoid this by distributing the contention over leaf nodes. This problem becomes compounded with more threads and increasing depth of the model. To my mind this does not solve the "parallel problem" as the performance gets worse with increased concurrency. This often results in parallel slowdown. How people talk about this also highlights to me how so many developers fundamentally misunderstand the difference between parallelism and concurrency.
Persistent data structures rely heavily on allocation and GC to support path-copy and keeping immutable nodes around. In my experience large parts of these trees tend to not fit with the weak generational hypothesis on which most modern garbage collectors are designed. This can require significant GC tuning to reach even modest performance levels. Objects that make it to the survivor spaces, or worse the old generation, cost significantly more to manage.
The fact that most persistent data structures require trees and indirection - aka pointer chasing - for implementation goes against the advances in caches, prefetchers, and branch predication of modern processors.

In my personal experience I cannot get within 10X the throughput, or latency, of mutable data models when using persistent data models. I've tried really hard on occasions because I've wanted the persistent semantics. However I did find some things that helped a lot. One such approach was to have a single mutating thread for the data model, all others are readers. Datomic does this on a larger scale. Updates from multiple threads can be queued to the mutator thread. This does not address the GC issues but it does address the CAS failure issues on the root node. The GC issues can be greatly relieved, i.e. no really nasty old generation compaction pauses, by using concurrently compacting JVM such as Azul Zing.

Now with this all been said. Many many applications can achieve perfectly adequate performance using persistent data structures while enjoying the other benefits they bring. Good clean composable design comes first and often will get you all the performance you need. When people screw up performance it is often really silly stuff like bad algorithms and senseless network round trips. If you get all the basics right and testing shows you still have a performance issue, then profiling can sometimes show this can be traced to persistent data structures causing too much root node contention or stressing the garbage collector.

When I talk about single threaded apps what I mean is all data is owned by a single thread. There can be many threads in a process. The key is they do not share state. The best way to achieve this is via message passing. Erlang is a good example of this, I only wish they took the same principles more to heart in the design of their own runtime ;-)

Martin...

Andy C

unread,

Mar 17, 2014, 4:22:31 PM3/17/14

to clo...@googlegroups.com

From what I understand, a single core can easily saturate a memory bus. At the same time L2 and L3 caches are so small as compared to GB of memory systems yet growing them does not necessarily help either due to larger latencies. It all limits the number of practical applications which could really take a full advantage of multi core architectures.

Having said that I just wonder if a typical async Clojure application can even be efficiently performed on a multi-core architecture. In any case, immutability helps writing reliable multi "execution context" software there is no need for explicit synchronization in most of cases.

BTW, I absolutely love the "async" name itself chosen by Clojure authors as opposed to "reactive" as in http://www.reactivemanifesto.org. Reactive comes with bad connotations and really does not reflect the intentions well.

Best,
Andy

Luc Prefontaine

unread,

Mar 17, 2014, 6:38:57 PM3/17/14

to clo...@googlegroups.com

In the "old" days, the principle of locality was prevalent.
Most data allocation was static
or if not contiguous (the stack).
Code was also contiguous being
most of the time statically compiled
on several pages.

Memory was costly, its size remained
"reasonable".

Today, data is scattered
everywhere in a huge memory,
its location changes frequently,
code can be dynamically compiled,
get loaded by small chunks, ....

Hardware wise they do what they
can to deal with this but I think
it's a lost battle.

We need a quantum jump at the
hardware level to get some
faster random access to memory.

This will probably shake the software
stack a lot. Which brings other issues...

Meanwhile we are stuck with
finding the least painful path to get
our stuff to run asap.

Luc P.

Andy C

unread,

Mar 17, 2014, 11:49:49 PM3/17/14

to clo...@googlegroups.com

Today, data is scattered
everywhere in a huge memory,
its location changes frequently,
code can be dynamically compiled,
get loaded by small chunks, ....

It describes my case actually - I am loading about 2-8GB of "stuff" in the memory and to tons of mapping, grouping by and filtering. That works reasonably well now but will not scale up soon.

A.

Andy C

unread,

Mar 17, 2014, 11:59:40 PM3/17/14

to clo...@googlegroups.com

In my personal experience I cannot get within 10X the throughput, or latency, of mutable data models when using persistent data models.

Hi Martin,

Thanks for finding this thread :-). Let me ask a reversed question. Given you come from a persistent data model where code remains reasonably simple. How much effort it really takes to make an imperative model working well with relatively low number of defects? How deep you go to make sure that data structures fit CPU architecture in terms of topology as well as of size of caches? And how it scales in terms of writing the code itself (I mean are code alternations are easy straightforward or you have to write it from scratch)?

I do not mean to troll, just sincere curiosity ...

Andy

Martin Thompson

unread,

Mar 18, 2014, 6:28:51 AM3/18/14

to clo...@googlegroups.com

As a co-author of the reactive manifesto I'd like to point out that "reactive" can be considered a superset of "async". Good reactive applications are event driven and non-blocking. They are also responsive, resilient, and scalable which async can help with but does not prescribe.

What are the "bad connotations"? I'm curious to understand other perspectives on this.

Martin Thompson

unread,

Mar 18, 2014, 7:03:24 AM3/18/14

to clo...@googlegroups.com

In my personal experience I cannot get within 10X the throughput, or latency, of mutable data models when using persistent data models.

Hi Martin,
Thanks for finding this thread :-). Let me ask a reversed question. Given you come from a persistent data model where code remains reasonably simple. How much effort it really takes to make an imperative model working well with relatively low number of defects? How deep you go to make sure that data structures fit CPU architecture in terms of topology as well as of size of caches? And how it scales in terms of writing the code itself (I mean are code alternations are easy straightforward or you have to write it from scratch)?

I've never heard of "imperative model". I'm aware of imperative programming. Can you expand on what you mean?

I think in the context of this thread 2 points are being conflated here. We started discussing the point about how "immutable data (actually persistent data structures) solve the parallel problem". Parallel in the context of increasing thread count as a result of core counts increasing in CPUs, but actually concurrent in access and mutation of data structures in the micro and domain models in the macro. This is being conflated with the performance of data structures in general.

I believe that concurrent access to data structures should not be the default design approach. The greatest complexity in any system often comes from concurrent access to state, be the persistent or not. It is better to have private data structures within processing contexts (threads, processes, services, etc.), which communicate via messages. Within these private processing contexts concurrent access is not an issue and thus non-concurrent data structures can be employed. With non-concurrent access it is easy to employ rich data structures like basic arrays, open-addressing hash maps, B+ and B* trees, bitsets, bloom filters, etc., without which many applications would be unusable due to performance and memory constraints.

When working in the context of single threaded access to data structures I see using a blend of functional, set theory, OO, and imperative programming techniques as the best way to go. Right tool for the job. I see the effort levels as very similar when choosing the appropriate technique that fits a problem. I have many times seen a code mess and pain result from inappropriate techniques applied blindly like religion to problems that just are not a good fit.

So to more directly answer you question. If I need to have concurrent access to a shared data structures I prefer it to be persistent from a usability perspective given performance constraints of my application at satisfied. If I need greater performance I find non-persistent data structures can give better performance for a marginal increase in complexity. I've never quantified it but it feels like percentage points rather than factors. I frame this in the context that the large step in complexity here is choosing to have concurrent access to the data. It is the elephant in the room. Those who are not good at concurrent programming should just not be in this space in the first place because if they do it will get ugly either way. I'd recommend reading the findings of the "double hump" paper from Brunel University on peoples natural ability in programming.

http://blog.codinghorror.com/separating-programming-sheep-from-non-programming-goats/

Hope this helps clarify.

Martin...

Andy C

unread,

Mar 18, 2014, 12:33:35 PM3/18/14

to clo...@googlegroups.com

As a co-author of the reactive manifesto I'd like to point out that "reactive" can be considered a superset of "async". Good reactive applications are event driven and non-blocking. They are also responsive, resilient, and scalable which async can help with but does not prescribe.

What are the "bad connotations"? I'm curious to understand other perspectives on this.

I would say the opposite. Asynchronous or not >happening, moving, or existing at the same time< does not necessarily imply non-blocking. It simply emphasizes that "things" might coexist without an explicit dependency (in a time dimension) on each other. At the same time (pun not intended) reactive limits itself to responding to events which implies a control by some sort of FSM. Perhaps concurrency could be modeled using FSMs, but I do not believe it is always a simple transition.

Andy

Gary Trakhman

unread,

Mar 18, 2014, 12:41:19 PM3/18/14

to clo...@googlegroups.com

Martin,

You recommend message-passing approaches like Erlang's as generally superior, but I'm curious if there's any more specific thoughts to the relative tradeoffs from shared-memory by default vs message-passing, ie, where you might rely on hardware-level copies (cache coherence) for coordination in the first, but you rely on software-level data copies in the second for coordination.

Is it just a more conservative approach to allocate resources by having to go through more manual work to coordinate processes, or is there more to it? I could see that having development cost (time and effort) proportional to the amount of coordination might result in systems that are likely to avoid these bottlenecks where possible. In the case of shared-memory, there is an inverse relationship, ie it takes more work to decouple systems and remove extra synchronization than it does to build them badly.

Andy C

unread,

Mar 18, 2014, 12:45:45 PM3/18/14

to clo...@googlegroups.com

I've never heard of "imperative model". I'm aware of imperative programming. Can you expand on what you mean?

I meant "mutable data model". Sorry for mixing up terms.

http://blog.codinghorror.com/separating-programming-sheep-from-non-programming-goats/

Hope this helps clarify.

It does. I absolutely agree and get the point that "persistent data structure + multiple cores/threads" approach does solve much here, in fact probably makes things worse. And indeed it is better to queue "Updates from multiple threads" and get it handled by "the mutator thread". But at the same time I hate to have my imagination be "limited" by the hardware designs. That really hurts :-).

Best regards,

Andy

Gary Trakhman

unread,

Mar 18, 2014, 12:52:18 PM3/18/14

to clo...@googlegroups.com

If it's any consolation, queues or delays are used in hardware to overcome limitations in hardware, too :-).

Specifically, I'm thinking of anti-jitter stuff in audio circuits.

Martin Thompson

unread,

Mar 18, 2014, 1:03:58 PM3/18/14

to clo...@googlegroups.com

As a co-author of the reactive manifesto I'd like to point out that "reactive" can be considered a superset of "async". Good reactive applications are event driven and non-blocking. They are also responsive, resilient, and scalable which async can help with but does not prescribe.

What are the "bad connotations"? I'm curious to understand other perspectives on this.

I would say the opposite. Asynchronous or not >happening, moving, or existing at the same time< does not necessarily imply non-blocking. It simply emphasizes that "things" might coexist without an explicit dependency (in a time dimension) on each other. At the same time (pun not intended) reactive limits itself to responding to events which implies a control by some sort of FSM. Perhaps concurrency could be modeled using FSMs, but I do not believe it is always a simple transition.

Our use of language in the technology industry could, for sure, be better. Take simple examples like RAM where random should be "arbitrary", or don't get me started on people who misuse the term "agnostic" ;-)

Synchronous is now understood to be making a call in a blocking fashion with an optional return value. What is actually missing is the following implied word, i.e. communication. We are normally talking about synchronous communication or asynchronous communication but the "communication" part gets dropped.

Good traits/qualities in critical systems are to be responsive, resilient, and scalable which is at the heart of the reactive manifesto. Many techniques can be employed to achieve this. One such technique that has yet to be bettered in my experience is to be event driven and non-blocking. This is very much how the real world functions. We all have our own private models of the world that we share with message passing, or events given your chosen nomenclature.

As an aside I believe FSMs are a very useful tool that more developers should have in their toolbox. If they are observable in implementation then even better. They have been totally invaluable whenever I've done any work on highly-available systems.

Laurent PETIT

unread,

Mar 18, 2014, 1:18:50 PM3/18/14

to clo...@googlegroups.com

Reminds me the (abandoned?) work by Rich on Pods ...

Raoul Duke

unread,

Mar 18, 2014, 2:06:36 PM3/18/14

to clo...@googlegroups.com

> some sort of FSM. Perhaps concurrency could be modeled using FSMs, but I do
> not believe it is always a simple transition.

http://www.cis.upenn.edu/~stevez/papers/LZ06b.pdf

:-)

François Rey

unread,

Mar 18, 2014, 5:23:46 PM3/18/14

to clo...@googlegroups.com

On 18/03/14 18:03, Martin Thompson wrote:

Our use of language in the technology industry could, for sure, be better. Take simple examples like RAM where random should be "arbitrary", or don't get me started on people who misuse the term "agnostic" ;-)

I would even say our use of abstractions in the tech industry could be better, most notably because of the law of leaky abstractions: how abstractions such as RAM, GC, persistent data structure, etc. always have a limit at which the things they're supposed to abstract come back right in your face and you end up needing to be aware of the abstracted things. I think mechanical sympathy is all about dealing with leaky abstractions. The art of knowing and playing with the limits of our abstractions is what Rich Hickey would call knowing the trade-offs.

I don't think the law of leaky abstraction has any formal underpinning, but using the concept of trade-off it's about what we gain from using an abstraction, which we often know very well because that's the whole point of using it, and what we loose, which is often the forgotten and darker side. In those terms an abstraction without leakiness is one that does not loose anything, whether in terms of information or capability. Know of any? Isn't the whole point of an abstraction to be hiding certain underlying aspects, therefore loosing seems to be part of the deal, meaning abstractions are leaky by nature? I think they are, but don't ask me for a proof.

The thing is that our industry is based on layers upon layers of abstractions, whether at the physical level (integrated circuits, interfaces, etc.) or at the software level: binary (1GL) abstracted into assembly (2GL), then C language (3GL), etc. Virtual machines is now another level of abstraction which is probably here to stay, at least in terms of flexibility offered. So perhaps a key question is how many levels of abstraction can we afford before the whole thing becomes too difficult to manage. If think we even have situations where leakiness at one level makes another level even more leaky, at which point one needs to reconsider the whole stack of abstractions, perhaps from hardware up to programming language and runtime, and check for the soundness of the trade-offs made at each level, and then provide more than one choice.

So in the end it's all about knowing the trade-off we make at each level of the stack we use. I don't think many of us do, this is becoming more and more obvious as the stack becomes higher and higher. We don't have to be a mechanic to drive a car. But it helps to know what we can expect from it, so we can better choose other modes of transportation when for example we need to travel 1000 km or miles in 4 hours, or travel with 10+ people, or travel where there's no road. End-users of computer programs don't have to know the mechanics. But programmers do if they want to excel at their trade. I don't think we can avoid this: space and time are inescapable dimensions of our material reality and there will always be trade-offs around them.

Raoul Duke

unread,

Mar 18, 2014, 5:29:25 PM3/18/14

to clo...@googlegroups.com

> The thing is that our industry is based on layers upon layers of
> abstractions, whether at the physical level (integrated circuits,
> interfaces, etc.) or at the software level: binary (1GL) abstracted into
> assembly (2GL), then C language (3GL), etc. Virtual machines is now another

you maybe forgot microcode, too? ;-) ;-)

yes, there are a lot of abstractions! i day-dream that it would be
super keen if there were a way for us to have explicit models of them,
including their leaks! and if those models where composable. and then
if our tools could use those models to show us warnings and errors; to
generate stress tests in a guided-randomized quickcheck style; etc.
etc. etc.

Andy C

unread,

Mar 18, 2014, 11:44:42 PM3/18/14

to clo...@googlegroups.com

I like FSMs, but they do not compose well.

A.

Raoul Duke

unread,

Mar 19, 2014, 2:14:55 PM3/19/14

to clo...@googlegroups.com

> I like FSMs, but they do not compose well.

some have argued for generative grammars that generate the fsm,
because it is generally easier to compose grammars, and then generate
the final fsm. iiuc.

John Mastro

unread,

Mar 19, 2014, 5:02:52 PM3/19/14

to clo...@googlegroups.com

Due to the path-copy semantics, the contention gets driven to the root of the tree.

Out of curiosity, would reference counting (rather than or in addition to "normal" GC) help with this? Or is reference counting problematic in a highly concurrent environment? It seems like reference cycles will be less of an issue with immutable data.

- John

Martin Thompson

unread,

Mar 19, 2014, 5:26:22 PM3/19/14

to clo...@googlegroups.com

Due to the path-copy semantics, the contention gets driven to the root of the tree.

Out of curiosity, would reference counting (rather than or in addition to "normal" GC) help with this? Or is reference counting problematic in a highly concurrent environment? It seems like reference cycles will be less of an issue with immutable data.

Reference counting tends to be less efficient than a tracing collector in a concurrent environment. Reference counts must employ atomic updates that can be relatively expensive, e.g. similar to how AtomicLong.incrementAndGet() works. I've blogged on how atomic increments work.

http://mechanical-sympathy.blogspot.co.uk/2011/09/adventures-with-atomiclong.html

Andy C

unread,

Mar 19, 2014, 11:03:10 PM3/19/14

to clo...@googlegroups.com

So, the following test puzzles me. Not because it takes virtually the same time (I know that Fork/Join is not cheap and memory is probably the biggest bottleneck here). But because I do not get why map (as opposed to r/ma) uses all 8 cores on my MacBookPro. All of them seem to be running according to Activity Monitor at more less the same level.

user=> (def l (into [] (range 60000000)))
#'user/l
user=> (time (def a (doall (map #(Math/sin (* % %)) l))))
"Elapsed time: 19986.18 msecs"

user=> (time (def a (doall (into [] (r/map #( Math/sin (* % %)) l)))))
"Elapsed time: 18980.583 msecs"

Andy C

unread,

Mar 19, 2014, 11:09:54 PM3/19/14

to clo...@googlegroups.com

I thought about it too but composing FSMs out of some patterns really transcends my mind unfortunately. In another words, I have hard time to see patterns in FSMs - they tend to be quite complex in the first place. Sometimes a single transition might ruin the perception and completely the properties of everything.

There is a very good book treating about subject: http://www.amazon.com/Practical-Statecharts-Quantum-Programming-Embedded-ebook/dp/B0017UASZO/ref=sr_1_3?s=books&ie=UTF8&qid=1395284883&sr=1-3&keywords=C%2B%2B+state+machine

A.

François Rey

unread,

Mar 20, 2014, 6:34:33 AM3/20/14

to clo...@googlegroups.com

I would also expect this code to run on a single CPU at any one time, however process/thread scheduling can make this thread run on different cores at different times. Depending on the sampling method the activity monitor may display another picture of what you would expect. On my linux machine, the cpu history graph shows it's using mostly one cpu at any one time, but it's not always the same cpu.
I think the JVM uses all cores by default, and there's no standard way to specify thread affinity.

László Török

unread,

Mar 20, 2014, 7:26:08 AM3/20/14

to clo...@googlegroups.com

"into" uses "reduce" under the hood, you probably need "fold" to have the computation to run on FJ:

(def a (time (r/foldcat (r/map #( Math/sin (* % %)) l)))))

this runs more than 2x as fast on my macbook pro

Las

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

László Török

Checkout Lollyrewards.com - Giving credit, getting credit.

Follow us on Twitter @lollyrewards

Stay up to date on the latest offers by liking our Facebook page.

François Rey

unread,

Mar 20, 2014, 8:21:38 AM3/20/14

to clo...@googlegroups.com

On 20/03/14 12:26, László Török wrote:

"into" uses "reduce" under the hood, you probably need "fold" to have the computation to run on FJ:

Yes, now I see all my CPUs being maxed-out.

francois@laptop:~$ perf stat java -cp ~/.m2/repository/org/clojure/clojure/1.5.1/clojure-1.5.1.jar clojure.main -e "(require '[clojure.core.reducers :as r]) (let [l (into [] (range 10000000))] (time (r/fold + (r/map #(Math/sin (* % %)) l))))"
"Elapsed time: 1594.89737 msecs"
678.1151045769922

Performance counter stats for 'java -cp /home/francois/.m2/repository/org/clojure/clojure/1.5.1/clojure-1.5.1.jar clojure.main -e (require '[clojure.core.reducers :as r]) (let [l (into [] (range 10000000))] (time (r/fold + (r/map #(Math/sin (* % %)) l))))':

      23460.971496 task-clock                #    4.728 CPUs utilized
             7,663 context-switches          #    0.000 M/sec
               570 CPU-migrations            #    0.000 M/sec
           174,586 page-faults               #    0.007 M/sec
    68,361,248,389 cycles                    #    2.914 GHz                     [83.33%]
    44,379,949,020 stalled-cycles-frontend   #   64.92% frontend cycles idle    [83.21%]
    22,520,594,887 stalled-cycles-backend    #   32.94% backend cycles idle    [66.75%]
    63,715,541,342 instructions              #    0.93 insns per cycle
                                             #    0.70 stalled cycles per insn [83.39%]
    11,104,063,057 branches                  # 473.299 M/sec                   [83.25%]
       155,172,776 branch-misses             #    1.40% of all branches         [83.46%]

       4.962656410 seconds time elapsed

francois@laptop:~$ perf stat java -cp ~/.m2/repository/org/clojure/clojure/1.5.1/clojure-1.5.1.jar clojure.main -e "(require '[clojure.core.reducers :as r]) (let [l (into [] (range 10000000))] (time (reduce + (map #(Math/sin (* % %)) l))))"
"Elapsed time: 5694.224824 msecs"
678.1151045768991

Performance counter stats for 'java -cp /home/francois/.m2/repository/org/clojure/clojure/1.5.1/clojure-1.5.1.jar clojure.main -e (require '[clojure.core.reducers :as r]) (let [l (into [] (range 10000000))] (time (reduce + (map #(Math/sin (* % %)) l))))':

      21365.047020 task-clock                #    2.324 CPUs utilized
             9,443 context-switches          #    0.000 M/sec
               375 CPU-migrations            #    0.000 M/sec
           164,960 page-faults               #    0.008 M/sec
    64,028,578,416 cycles                    #    2.997 GHz                     [83.21%]
    39,016,283,425 stalled-cycles-frontend   #   60.94% frontend cycles idle    [83.26%]
    21,038,585,232 stalled-cycles-backend    #   32.86% backend cycles idle    [66.77%]
    63,857,091,972 instructions              #    1.00 insns per cycle
                                             #    0.61 stalled cycles per insn [83.29%]
    11,161,960,110 branches                  # 522.440 M/sec                   [83.48%]
       138,372,839 branch-misses             #    1.24% of all branches         [83.28%]

       9.195138860 seconds time elapsed

francois@laptop:~$

Andy C

unread,

Mar 30, 2014, 1:40:24 AM3/30/14

to clo...@googlegroups.com

Hi,

So this is a follow-up. I claimed that 1 CPU core can saturate the memory but it turns out I was wrong, at least to some extend. Driven by curiosity I decided to do some measurements and test my somewhat older MBP 2.2GHz Inter Core i7. While it obviously all depends on the hardware, I thought it could be still a good test.

In order to rule out the GC and JVM out of equation I went back to old good C and wrote a simple program which accesses a 40MB chunk of memory in both linear and random manner. All tests run a few times to ensure proper warm up and allocations within OS, however saw a great deal of consistency. It is not scientific by any means, but gives a rough idea what we are dealing with. Here are results where numbers are normalized gains.

+----------------+-----------+------------+
| # of processes | random   | linear    |
+----------------+-----------+------------+
|        1       |   1.00    |   1.00     |
+----------------+-----------+------------+
|        2       |   1.97    |   1.76     |
+----------------+-----------+------------+
|        4       |   3.51    |   1.83     |
+----------------+-----------+------------+
|        8       |   4.24    |   1.86     |
+----------------+-----------+------------+

The conclusion is that in practice two cores can easily saturate memory buses. Accessing it in certain patters helps to some extend. Although 8 cores is pretty much all what makes sense unless you do tons of in cache stuff.

Best,
Andy

François Rey

unread,

Mar 30, 2014, 4:39:10 AM3/30/14

to clo...@googlegroups.com

On 30/03/14 07:40, Andy C wrote:

Here are results where numbers are normalized gains.

+----------------+-----------+------------+
| # of processes | random   | linear    |
+----------------+-----------+------------+
|        1       |   1.00    |   1.00     |
+----------------+-----------+------------+
|        2       |   1.97    |   1.76     |
+----------------+-----------+------------+
|        4       |   3.51    |   1.83     |
+----------------+-----------+------------+
|        8       |   4.24    |   1.86     |
+----------------+-----------+------------+

This is great stuff.
Let me make sure I read it correctly.
Having 2 processes makes a value 1.97 times higher than with 1 core in the random case, and 1.76 times higher in the linear case, but what is that value being measured?
Some form of throughput I suppose and not time, right?

The conclusion is that in practice two cores can easily saturate memory buses. Accessing it in certain patters helps to some extend. Although 8 cores is pretty much all what makes sense unless you do tons of in cache stuff.

Indeed. It also means single threaded linear access isn't going to be very much faster if you add more threads.
BTW, are you sure the threads were running in parallel on separate cores and not just concurrently on a smaller number of cores?
As you said, this should be dependent on hardware and running this on actual server machine would be as interesting.

Martin Thompson

unread,

Mar 30, 2014, 5:39:23 AM3/30/14

to clo...@googlegroups.com

Memory access patterns make a huge a difference to memory throughput. I've explored this in some detail in the following blog.

http://mechanical-sympathy.blogspot.co.uk/2012/08/memory-access-patterns-are-important.html

Andy C

unread,

Mar 31, 2014, 9:32:56 AM3/31/14

to clo...@googlegroups.com

This is a slightly different result as this time I measure elapsed time (see appendix and excuse not so nice code) as opposed to a clock time. Results are similar (unless you have more processes than cores). I am planning to release the code to github soon.

+----------+--------+-------+
| # of     | seq   | rdm   |
|processes |        |       |
+----------+--------+-------+
|1         | 1.00   | 1.00 |
+----------+--------+-------+
|2         | 1.96   | 1.75 |
+----------+--------+-------+
|4         | 3.20   | 1.83 |
+----------+--------+-------+
|8         | 3.78   | 1.83 |
+----------+--------+-------+
|16        | 3.61   | 1.81 |
+----------+--------+-------+
|32        | 3.56   | 1.81 |
+----------+--------+-------+

This is great stuff.
Let me make sure I read it correctly.
Having 2 processes makes a value 1.97 times higher than with 1 core in the random case, and 1.76 times higher in the linear case, but what is that value being measured?
Some form of throughput I suppose and not time, right?

I think you could call that a normalized throughput. Here are more details. First column is the number of separate O/S test processes running in background concurrently, started by the shell script virtually at the same time. Then I collected the output which simply logs how long it takes to iterate thru 40MB of memory in sequential or random manner. Second and third column are number_of_processes/elapsed_time*elapsed_time_from_first_row_process for sequential and random access respectively. In ideal conditions, elapsed_time should be constant as we use more and more cores/CPUs.

Indeed. It also means single threaded linear access isn't going to be very much faster if you add more threads.
BTW, are you sure the threads were running in parallel on separate cores and not just concurrently on a smaller number of cores?
As you said, this should be dependent on hardware and running this on actual server machine would be as interesting.

I wanted to see the worse case, separate processes and memory , which was simplest to implement. Yes, I am sure that cores were utilized by OS as the number of processed added. I watched MBP Activity Monitor and CPU History which was hitting 100%. Also I did not optimize the output (should not matter).

One interesting thing, I heard somewhere, that O/Ss bounce long lasting CPU intensive threads between cores in order to equalize the heat generated from the silicon. I did not observe that but longest running test took about 15 sec using a single core.

Thx,
Andy

Appendix 1:

        {
            int n;
            int j;
            struct timeval begin,end;
            gettimeofday (&begin, NULL);
            for(n=0;n<100;n++)
                for (j=0;j<len;j++)
                    i_array[rand()%len]=j;
            gettimeofday (&end, NULL);
            printf("[:rdm %s %d %d %f]\n",des,len,m,tdiff(&begin,&end));
        }
        {
            int n;
            int j;
            struct timeval begin,end;
            gettimeofday (&begin, NULL);
            for(n=0;n<100;n++)
                for (j=0;j<len;j++)
                    i_array[j]=j;
            gettimeofday (&end, NULL);
            printf("[:seq %s %d %d %f]\n",des,len,m,tdiff(&begin,&end));
        }

Andy C

unread,

Mar 31, 2014, 9:37:29 AM3/31/14

to clo...@googlegroups.com

Memory access patterns make a huge a difference to memory throughput. I've explored this in some detail in the following blog.

http://mechanical-sympathy.blogspot.co.uk/2012/08/memory-access-patterns-are-important.html

Thanks for sharing. From the Clojure perspective and using reducers, it seems we are hitting the worse case. Not sure if it worth effort of using it (at least in my applications).

Thx,
Andy

Justin Smith

unread,

Apr 10, 2014, 3:33:17 PM4/10/14

to clo...@googlegroups.com

One could quantify this with Category Theory. In order to map from one domain to another reversibly, the target domain must have at least as many elements as the source domain, if not more. An abstraction which simplifies by reducing the number of elements in play is guaranteed to be "leaky".

Of course there could be abstractions that work not by reducing elements but organizing them so they map more efficiently to human thinking. Information theory also has tools that help us quantify when some relation or structure is lost in a translation from a domain to its abstraction.

Reply all

Reply to author

Forward