the real latency performance killer

1,142 views
Skip to first unread message

Robert Engels

unread,
Apr 17, 2014, 12:46:26 PM4/17/14
to mechanica...@googlegroups.com
I was referred to this group by a colleague, and the participants are certainly more knowledgeable than myself, but I'd like to throw out my two cents anyway.

As a recent blog post of my showed, Java easily outperforms C++ in real-world tests, but even this test is flawed...

The problems is that even though this is a real-world test, doing "real" work, it is still essentially a micro-benchmark. Why?

Which brings me to the heart of the problem... Here are the memory access times on a typical modern processor:

Core i7 Xeon 5500 Series Data Source Latency (approximate)
L1 CACHE hit, ~4 cycles
L2 CACHE hit, ~10 cycles
L3 CACHE hit, line unshared ~40 cycles
L3 CACHE hit, shared line in another core ~65 cycles
L3 CACHE hit, modified in another core ~75 cycles remote
remote L3 CACHE ~100-300 cycles
Local Dram ~60 ns
Remote Dram ~100 ns


So with my "real-world" test, the heart of the code path is always in level 1 cache, with the predicative loading of the cache when the message object is retrieved.

Now, compare this with a true real-world application, with gigabytes of heap. Most modern processors have about 20 mb of shared level 3 cache, which is a fraction of the memory in use, so when the garbage collector is moving things around, and or background "house-keeping" tasks are doing their work, they are blowing out the CPU caches, (even the non-shared L2 cache is destroyed by a compacting garbage collector). Even isolated CPUs don't help with the latter.

So when your low-frequency, but low-latency (say sending an order in response to some market event), this code is going to run 5x (or more if NUMA is involved) slower than the micro-benchmark case due to the non-cached main memory access.

How do we fix this? 2 ways.

With CPU support for "non-cached reads and writes",  a thread or (possibly a class/object) can be marked as "background", and then memory access by this thread/class do not go through the cache, hopefully preserving the L2/L3 cache for the "important" threads.

Similarly, an object/class marked "important" is a clue to the garbage collector to not move this object around if at all possible. This can sort of be solved now with off-heap memory structures, but they're are pain (at least in the current incarnation).

Without something similar to the above, I just don't think low-frequency and low-latency is possible.




Robert Engels

unread,
Apr 17, 2014, 12:59:16 PM4/17/14
to mechanica...@googlegroups.com
Also, this paper has some really good research on the problem.

Ross Bencina

unread,
Apr 17, 2014, 1:38:59 PM4/17/14
to mechanica...@googlegroups.com
On 18/04/2014 2:46 AM, Robert Engels wrote:
> How do we fix this? 2 ways.
>
> With CPU support for "non-cached reads and writes", a thread or
> (possibly a class/object) can be marked as "background", and then memory
> access by this thread/class do not go through the cache, hopefully
> preserving the L2/L3 cache for the "important" threads.
>
> Similarly, an object/class marked "important" is a clue to the garbage
> collector to not move this object around if at all possible. This can
> sort of be solved now with off-heap memory structures, but they're are
> pain (at least in the current incarnation).
>
> Without something similar to the above, I just don't think low-frequency
> and low-latency is possible.

Another trick to add to your bag is to partition your memory layout
based on cache associativity sets. These guys got some performance
improvement in their real-time memory allocator:

http://www.cister.isep.ipp.pt/ecrts11/prog/CAMAaPredictableCacheAwareMemoryAllocator.pdf

Key quote:

"Store descriptors only in memory locations mapped to a known,
bounded range of cache sets!"

Ross.

Vitaly Davidovich

unread,
Apr 17, 2014, 1:41:14 PM4/17/14
to mechanica...@googlegroups.com

There are only a few cases where java will beat c++.  Now, as you state in your blog, it's easier and faster to write a better performing java version of some algorithm, but there are many more optimization opportunities exposed in c++ than java.  So for a skilled c++ developer who is sympathetic to machines, they'll most likely outpace java.  Not to speak of amount of memory both servers will take.  Also, java is commonly plenty fast, true - the issue most people fight against in low latency is unpredictability of GC.  So both groups end up trying to avoid allocations: java guys with gc, c++ with malloc/new.

The cache issues you mention and generally the discrepancy between core and memory speed is primary reason java needs facilities to shrink footprint and stop chasing pointers.

Sent from my phone

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Robert Engels

unread,
Apr 17, 2014, 2:00:03 PM4/17/14
to mechanica...@googlegroups.com
I disagree somewhat, but it really depends on the use-case. In an enterprise type application with GBs of live data, without some macro level support of cache usage, your going to have a problem.

This is why really low-latency systems (even with Java), use separate processes in order to better isolate cache usage, but even here, unless your complete data set fits in the unshared L2 cache, you're going to have a problem as the Level 3 cache is going to be blown by other "enterprise" processes running on the other cores. You can isolate these processes onto different machines and pay the network penalty, or you need to not just do core isolation, but CPU isolation, and that gets expensive real quick...
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Vitaly Davidovich

unread,
Apr 17, 2014, 2:12:34 PM4/17/14
to mechanica...@googlegroups.com

I don't see how separate processes helps to isolate cache usage.  What some folks do is flat out divvy up the machine: interrupts are masked out to run on a subset of cores, some processes are then affinitized to run on other cores, etc.  Maybe that's what you meant, and it is a headache to maintain these setups.

Not sure which part you disagree with, but I'm sure we all agree that java could use a diet for data representation.  Irrespective of other things, it'd be nice if more stuff fit in cache to begin with before we start worrying about cache misses and the like.

Sent from my phone

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Martin Thompson

unread,
Apr 17, 2014, 2:15:13 PM4/17/14
to mechanica...@googlegroups.com
When it comes to memory access performance 4 major things matter:
  1. Volume of data you are shifting, but this is becoming less of an issue with every generation as bandwidth keeps taking huge strides forward.
  2. Locality: If you are in the same cache line or page then you benefit from warm data caches, TLB caches, and DDR sense amplifier row buffers.
  3. Predictable access patterns mean the prefetchers can hide the latency by prefeching the data in time for your instructions needing it. Pointer chasing is bad.
  4. Non-uniform memory access (NUMA) effects. When crossing interconnects between sockets you need to add 20ns for each one way hop, and depending on your CPU version you may not get prefetch support and be subject to unexpected writebacks. You need to get use to the likes of numactl and cgroups to ensure your processes run and access memory where you expect.
If your code shows no sympathy to the memory subsytems then you can pay a big performance price.


To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Robert Engels

unread,
Apr 17, 2014, 2:23:34 PM4/17/14
to mechanica...@googlegroups.com
It's mainly for Java, but can apply large C applications as well with certain allocators. By splitting the large process into multiple smaller ones, then isolating them, heap access by some tasks can't affect the cache of the other smaller "important" tasks.

For instance, the "order engine" could be an isolated JVM, with a very small heap, doing little garbage generation (or certainly a small portion of the heap). You are effectively isolating it's heap usage to fit within the L2 of the isolated core, so even if the garbage collector compacts the heap, it still resides in the L2.

I disagree with the general statement that you can write faster code in C than Java,,, when you add other real world restrictions like time to market, architecture flexability, correctness, etc.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Robert Engels

unread,
Apr 17, 2014, 2:27:48 PM4/17/14
to mechanica...@googlegroups.com
Agreed. I also think that the JVM can provide real gains here with dynamic memory analysis, rather than static compiler optimizations (and manually laying out of shared structures / cache lines, etc. by the developer).

People knock the higher level abstractions in Java, and continually want lower direct access to the hardware, because people think they are smarter than the machine. People will eventually figure out that using higher level abstractions will lead to better performance when they let go of their egos.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Robert Engels

unread,
Apr 17, 2014, 2:35:19 PM4/17/14
to mechanica...@googlegroups.com
Also, if the hardware engineers could figure out a way to make main memory access as fast as L1 cache, all these problems go away...


On Thursday, April 17, 2014 11:46:26 AM UTC-5, Robert Engels wrote:

Vitaly Davidovich

unread,
Apr 17, 2014, 2:40:38 PM4/17/14
to mechanica...@googlegroups.com

It's not that people think they're smarter than machines (although, that's true as well - people are the ones designing them, machines are just much quicker than humans).  The issue is that as you go lower in the stack, things become more and more generalized.  As a developer of some system, you typically know a lot more about data and its flow than anything lower than you.  In those cases, you want more control because, well, you happen to know more about your usecase.  The machine can pick up some patterns automatically (e.g. branch prediction, prefetch, etc), but they're going to be general "obvious" patterns. 

The idea of having JVM do dynamic layout based on some cpu feedback has been brought up before, but this is a hard problem.  What happens if workload changes? Are you going to re-layout everything? Leave it be? How is this data going to be collected and for what memory access? What is the perf implocation, cpu + mem? What happens if the profile collected is not indicative of the most optimal layout? This is already an issue with JIT compilation as its dynamic nature is a blessing and a curse.

It's nice to have default tuning done for you, but for high perf scenarios, there needs to be manual control exposed.

Sent from my phone

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Kirk Pepperdine

unread,
Apr 17, 2014, 2:53:34 PM4/17/14
to mechanica...@googlegroups.com

On Apr 17, 2014, at 8:23 PM, Robert Engels <ren...@ix.netcom.com> wrote:

> It's mainly for Java, but can apply large C applications as well with certain allocators. By splitting the large process into multiple smaller ones, then isolating them, heap access by some tasks can't affect the cache of the other smaller "important" tasks.

I don’t see how this will help. Threads within a process or in different processes will have the same over all effect on the cores and associated cache. Threads and processes are abstractions or ways of organizing things for humans. By the time they hit a core….

Regards,
Kirk

Vitaly Davidovich

unread,
Apr 17, 2014, 2:59:51 PM4/17/14
to mechanica...@googlegroups.com
Fit into L2? L2 is like 512kb - 1mb.

The time to market argument in favor of java vs c++ is only relevant, I think, if you're constantly starting from scratch.  Architecture flexibility is really an artifact of engineers working on the project; I've seen horror shows in java as well, this is not a language issue.  Correctness -- hmm, a bit hard to say.  There's certainly a class of errors you won't see in java, but don't know if that's what you mean by correctness.  The actual business logic correctness is, again, in the hands of developers on the project.  c++ is a tricky language as a whole, but one doesn't have to use the entire language.  Also, compiler support and static analysis tools are getting better there as well.

Having said all that, I'm a huge fan of the JVM; I think it's an excellent piece of engineering, and given all that it provides, the speed code can run at is pretty impressive.  And in a lot of cases, it's fast *enough*.  However, in domains where ultimate speed/efficiency == $$ (whether via direct means or indirect, such as requiring fewer machines), it can pay to squeeze as much as possible out of a machine.


To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Robert Engels

unread,
Apr 17, 2014, 3:02:55 PM4/17/14
to mechanica...@googlegroups.com
But as you start making those decisions lower and lower, your software becomes very rigid, and unable to respond to changes in architecture, usages, etc. so when you factor in all of the other real world concerns, I just don't buy that your going to consistently out-perform a generalized, easy to change system (where there are hundreds of developers continually improving the generalized internals).

As far as the machines only being faster, not smarter, I'm not sure I buy that either. Take a 30 variable multiple regression on a large data set. Yes, the human designed the system, but he could never solve it without the machine... if you're too slow to matter, you might as well be dumb too... (and then there is all of the machine learning, and genetic algorithms stuff which is a whole other topic ...)

-----Original Message-----
From: Vitaly Davidovich
Sent: Apr 17, 2014 1:40 PM
To: mechanica...@googlegroups.com
Subject: Re: the real latency performance killer

It's not that people think they're smarter than machines (although, that's true as well - people are the ones designing them, machines are just much quicker than humans).  The issue is that as you go lower in the stack, things become more and more generalized.  As a developer of some system, you typically know a lot more about data and its flow than anything lower than you.  In those cases, you want more control because, well, you happen to know more about your usecase.  The machine can pick up some patterns automatically (e.g. branch prediction, prefetch, etc), but they're going to be general "obvious" patterns. 

The idea of having JVM do dynamic layout based on some cpu feedback has been brought up before, but this is a hard problem.  What happens if workload changes? Are you going to re-layout everything? Leave it be? How is this data going to be collected and for what memory access? What is the perf implocation, cpu + mem? What happens if the profile collected is not indicative of the most optimal layout? This is already an issue with JIT compilation as its dynamic nature is a blessing and a curse.

It's nice to have default tuning done for you, but for high perf scenarios, there needs to be manual control exposed.

Sent from my phone

On Apr 17, 2014 2:27 PM, "Robert Engels" <ren...@ix.netcom.com> wrote:
Agreed. I also think that the JVM can provide real gains here with dynamic memory analysis, rather than static compiler optimizations (and manually laying out of shared structures / cache lines, etc. by the developer).

People knock the higher level abstractions in Java, and continually want lower direct access to the hardware, because people think they are smarter than the machine. People will eventually figure out that using higher level abstractions will lead to better performance when they let go of their egos.

On Thursday, April 17, 2014 1:15:13 PM UTC-5, Martin Thompson wrote:
When it comes to memory access performance 4 major things matter:
  1. Volume of data you are shifting, but this is becoming less of an issue with every generation as bandwidth keeps taking huge strides forward.
  2. Locality: If you are in the same cache line or page then you benefit from warm data caches, TLB caches, and DDR sense amplifier row buffers.
  3. Predictable access patterns mean the prefetchers can hide the latency by prefeching the data in time for your instructions needing it. Pointer chasing is bad.
  4. Non-uniform memory access (NUMA) effects. When crossing interconnects between sockets you need to add 20ns for each one way hop, and depending on your CPU version you may not get prefetch support and be subject to unexpected writebacks. You need to get use to the likes of numactl and cgroups to ensure your processes run and access memory where you expect.
If your code shows no sympathy to the memory subsytems then you can pay a big performance price.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QMaiYtYj4rk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Vitaly Davidovich

unread,
Apr 17, 2014, 3:03:17 PM4/17/14
to mechanica...@googlegroups.com
I think Robert was implying affinitizing the process(es) to run on only certain (non-overlapping) cpus; that's the "then isolating them" part.  At least that's how I understood it, in which case, there are cases where such a scenario helps.  With JVM processes, this is somewhat of an issue because now each of these processes incurs the same JVM overhead repeatedly, thus reducing the machine's capacity.


On Thu, Apr 17, 2014 at 2:53 PM, Kirk Pepperdine <ki...@kodewerk.com> wrote:

On Apr 17, 2014, at 8:23 PM, Robert Engels <ren...@ix.netcom.com> wrote:

> It's mainly for Java, but can apply large C applications as well with certain allocators. By splitting the large process into multiple smaller ones, then isolating them, heap access by some tasks can't affect the cache of the other smaller "important" tasks.

I don't see how this will help. Threads within a process or in different processes will have the same over all effect on the cores and associated cache. Threads and processes are abstractions or ways of organizing things for humans. By the time they hit a core....

Regards,
Kirk

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Robert Engels

unread,
Apr 17, 2014, 3:07:20 PM4/17/14
to mechanica...@googlegroups.com
Not true. If you isolate the core, and run the smaller JVM with small memory footprint on a single core (and nothing else on that core), then you have the L1 and L2 isolated from all other activity, and any compaction still results in the object being in the L2 cache.




-----Original Message-----
>From: Kirk Pepperdine <ki...@kodewerk.com>
>Sent: Apr 17, 2014 1:53 PM
>To: "mechanica...@googlegroups.com" <mechanica...@googlegroups.com>
>Subject: Re: the real latency performance killer
>
>
>On Apr 17, 2014, at 8:23 PM, Robert Engels <ren...@ix.netcom.com> wrote:
>
>> It's mainly for Java, but can apply large C applications as well with certain allocators. By splitting the large process into multiple smaller ones, then isolating them, heap access by some tasks can't affect the cache of the other smaller "important" tasks.
>
>I don't see how this will help. Threads within a process or in different processes will have the same over all effect on the cores and associated cache. Threads and processes are abstractions or ways of organizing things for humans. By the time they hit a core....
>
>Regards,
>Kirk
>
>--
>You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
>To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QMaiYtYj4rk/unsubscribe.
>To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Martin Thompson

unread,
Apr 17, 2014, 3:11:50 PM4/17/14
to mechanica...@googlegroups.com
This is a pipe dream. If you have studied modern hardware you realised the hierarchy will get deeper and we are moving to core local and tiled memory.

Robert Engels

unread,
Apr 17, 2014, 3:21:53 PM4/17/14
to mechanica...@googlegroups.com
I am certainly not a hardware guru by any means, but I recall people thinking 14 nano-meter CPUs were a pipe dream too... and now we're talking 10 nm...
-----Original Message-----
From: Martin Thompson
Sent: Apr 17, 2014 2:11 PM
To: mechanica...@googlegroups.com
Subject: Re: the real latency performance killer

This is a pipe dream. If you have studied modern hardware you realised the hierarchy will get deeper and we are moving to core local and tiled memory.

On 17 April 2014 19:35, Robert Engels <ren...@ix.netcom.com> wrote:
Also, if the hardware engineers could figure out a way to make main memory access as fast as L1 cache, all these problems go away...

--

Martin Thompson

unread,
Apr 17, 2014, 3:36:01 PM4/17/14
to mechanica...@googlegroups.com
I talk to hardware folk and cache hierarchies are getting deeper and innovation is looking at local memory to CPUs rather than huge shared memories.

We can always be surprised but there is nothing in the pipeline that suggests we are going to get large memory spaces at the 3-4 cycle response times of L1 caches.




--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Robert Engels

unread,
Apr 17, 2014, 3:42:50 PM4/17/14
to mechanica...@googlegroups.com
I agree that that is the likely direction (my original comment was intended as a joke), but that even makes more of a case for higher level abstractions to take advantage of huge (> 1024 core) machines with larger local caches.

With higher abstractions it becomes much easier to break processes apart and transparently integrate when needed (sometimes with no developer effort, everything is RMI, etc.), and let the OS/JVM figure out what to run where.

Trying to do this manually with massively parallel machines is very difficult.

Martin Thompson

unread,
Apr 17, 2014, 3:49:04 PM4/17/14
to mechanica...@googlegroups.com
I think some of the really interesting work on high-level abstractions in this area is on "Cache Oblivious Algorithms".

Here is a nice blog on potential speedup.

Vitaly Davidovich

unread,
Apr 17, 2014, 3:49:30 PM4/17/14
to mechanica...@googlegroups.com

There's already plenty of abstraction, through all layers, hardware to software.  You basically want something that given any random app X, the runtime will special case it in a nearly optimal way automatically and on-the-fly; and do it in most performant way; and do it everytime.  That's not going to happen - it'll get you the 80%, we need control over the other 20 (or even less, but there remains a need for manual control over the small yet important/hot percent of the codebase).

Sent from my phone

Robert Engels

unread,
Apr 17, 2014, 3:58:54 PM4/17/14
to mechanica...@googlegroups.com
I disagree. I'll let someone hand-tune (develop) to the exact configuration, hardware, software, etc. I'll write the code in a more generic manner, and I'll take advantage of every hardware generations improved capability far faster than the other - the other will always be behind in terms of performance because of this... (that being said, some configuration is needed today, as we haven't gotten that far along...)

Today, I can move our application to an IBM power 7 series, with 5 GHZ processors, and double the performance - all without changing a single line of code - not even a recompile.... Even if the machine costs a million dollars, how many developer years are saved...

-----Original Message-----
From: Vitaly Davidovich

Robert Engels

unread,
Apr 17, 2014, 3:59:21 PM4/17/14
to mechanica...@googlegroups.com
Interesting, thanks !
-----Original Message-----
From: Martin Thompson
Sent: Apr 17, 2014 2:49 PM
To: mechanica...@googlegroups.com
Subject: Re: the real latency performance killer

I think some of the really interesting work on high-level abstractions in this area is on "Cache Oblivious Algorithms".

Here is a nice blog on potential speedup.



On 17 April 2014 20:42, Robert Engels <ren...@ix.netcom.com> wrote:
I agree that that is the likely direction (my original comment was intended as a joke), but that even makes more of a case for higher level abstractions to take advantage of huge (> 1024 core) machines with larger local caches.

With higher abstractions it becomes much easier to break processes apart and transparently integrate when needed (sometimes with no developer effort, everything is RMI, etc.), and let the OS/JVM figure out what to run where.

Trying to do this manually with massively parallel machines is very difficult.

--

Vitaly Davidovich

unread,
Apr 17, 2014, 4:01:57 PM4/17/14
to mechanica...@googlegroups.com
ok :)

As an aside, I hope you realize that clock speed alone has stopped being a principal performance factor for a few generations of processors at this point.

Robert Engels

unread,
Apr 17, 2014, 5:54:00 PM4/17/14
to mechanica...@googlegroups.com
Btw, just saw this from the Power 7 Wikipedia page...

One feature that IBM and DARPA collaborated on is modifying the addressing and page table hardware to support global shared memory space for POWER7 clusters. This enables research scientists to program a cluster as if it were a single system, without using message passing. From a productivity standpoint, this is essential since some scientists are not conversant with MPI or other parallel programming techniques used in clusters.[5]

Gil Tene

unread,
Apr 17, 2014, 8:46:45 PM4/17/14
to mechanica...@googlegroups.com, Robert Engels
Unfortunately, It's Not Not true. ;-)

If you can isolate your small (or large) process (or set of threads) to a separate socket, you are protected from your cache being interfered with by anything not accessing it's contents.

But when you isolate a single core within a modern Xeon socket, your L1 and L2 are not isolated, and your noisy in-socket neighbors will still hurt you. The L3 on Xeons is inclusive of L2 and L1. When an LRU L3 line is evicted to make room for a newly read one, associated L2 and/or L1 contents go away with it.

So unless you dedicate an entire socket to your isolated process, you next best bet is to avoid going idle, while keeping your L1 and L2 warm by having your "idle loop" repeatedly access all the stuff you may need even when you don't need it. This won't prevent neighbor-driven eviction, but will have a much higher likelihood of pre-recovering from it before you actually miss in the L1/L2 when you care about it.

BTW, within Java VMs, you can separate the VM threads (mostly the GC threads) to run on a separate socket, which will keep them from thrashing your cache when they do their background work. This is actually fairly practical even with multiple JVMs and per-core isolation, since you can put all the JVMs GC threads on the "system" socket and keep all your application threads (in all JVMs) in the dedicated socket. I know people who actually do this...

Separately, if you do the "keep my cached stuff warm by accessing or modifying it all the time" thing in your per-thread isolated CPUs, even compaction/relocation of your objects by the GC don't hurt much, as your relocated objects will be pre-recovered back into L1/L2 just t=like they would if a neighbor process caused eviction through mere L3 pressure.

>To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

robert engels

unread,
Apr 17, 2014, 9:45:36 PM4/17/14
to Gil Tene, mechanica...@googlegroups.com
You are very correct... my bad.

We use that exact setup to avoid the GC threads thrashing the cache, but the GC still moves objects around which destroys the cache anyway...

Keeping needed code and cache hot (in an idle loop sense) is not always possible... (or often not easy to do without very ugly code).

To clarify on your first point though, if the other cores all have working sets within their L2 cache size (or less restrictive, their total working sets within L3 - sizeof(L2)), aren't you essentially protecting the L2 and L1 used by the isolated core?

But to bring us back where I started, this is why hardware support for non-cached read/writes with the ability to control which threads/classes use these calls might be helpful.

Your points also seem to suggest that specialized/embedded systems would probably do better with non-inclusive L3 caches - protecting the per core L2 based on the configured loads. Do you know any architectures using such a setup?

Kirk Pepperdine

unread,
Apr 18, 2014, 2:46:35 AM4/18/14
to mechanica...@googlegroups.com
Even with this level of isolation.. at some level you’ll be sharing and once you share you’ll have to deal with contention.
> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,
Apr 18, 2014, 2:55:15 AM4/18/14
to

Follow up answers inline.

On Thursday, April 17, 2014 6:45:36 PM UTC-7, Robert Engels wrote:
You are very correct... my bad.

We use that exact setup to avoid the GC threads thrashing the cache, but the GC still moves objects around which destroys the cache anyway...

Keeping needed code and cache hot (in an idle loop sense) is not always possible... (or often not easy to do without very ugly code).

Yes. It's ugly. But not keeping it warm guarantees it won't last long in the presence of other activity in the same socket.

Luckily, both spacial and temporal locality are alive and well in most applications, and hardware prefetchers are really good at dealing with multi-line access patterns, so missing this stuff back into the L1 is not that big a deal (sub usec hits to get back to being warm, usually). But that's just as true when GC kicked your objects out or moved them around...
 

To clarify on your first point though, if the other cores all have working sets within their L2 cache size (or less restrictive, their total working sets within L3 - sizeof(L2)), aren't you essentially protecting the L2 and L1 used by the isolated core?

Nothing practical (that interacts with the outside world) lives purely within it's L2 cache size. At the very least, any network i/o you are doing will be moving in and out of the L3 cache. After ~20MB of network traffic, or any other memory traffic, all your idle (not actively being hit) L2 and L1 contents will have been thrown away, and will generate new L3 cache misses. So if your isolated core is mostly idle or spinning (which is usually the case), and it does not actively access the contents of it's L2, any other activity in the socket will cause that cold L2 to get thrown away. 
 

But to bring us back where I started, this is why hardware support for non-cached read/writes with the ability to control which threads/classes use these calls might be helpful.

Non-cached read/writes are usually used for specialized i/o operations... They are so expensive (compared to cached ones) that I highly doubt this will be used for anything real (like "everything this thread does is non-cached"). Remember that non-cached also mean non-streaming and non-prefetcheable. It also means that each word or byte access is a separate ~200 cycle memory access. Also remember that stack memory is generally indistinguishable from local memory, and that the CPU has a limited set of registers... That all adds up to "threads/classes that are forced to use non-cacheable memory for everything are not useful for much of anything"
 
Your points also seem to suggest that specialized/embedded systems would probably do better with non-inclusive L3 caches - protecting the per core L2 based on the configured loads. Do you know any architectures using such a setup?

There is a good reason for the lowest level cache (aka "LLC"; the one closet to memory) being inclusive in pretty much all multi-socket architectures. When the LLC is inclusive of all closer-to-the-cores caches, coherency traffic is only needed between LLCs, and only for coherency state transitions at the LLC level. The hit and miss rates (in ops/sec, not in %) in LLCs are orders of magnitude smaller than those in L1 (and L2 where it exists). If the LLC was not inclusive, state changes in the inner caches would need to be communicated to all other caches, and the cross-socket coherency traffic volume would grow by a couple orders of magnitude, which simply isn't practical with chip-to-chip interconnects and pin counts.

Martin Thompson

unread,
Apr 18, 2014, 3:17:06 AM4/18/14
to mechanica...@googlegroups.com
On 18 April 2014 07:52, Gil Tene <g...@azulsystems.com> wrote:
Your points also seem to suggest that specialized/embedded systems would probably do better with non-inclusive L3 caches - protecting the per core L2 based on the configured loads. Do you know any architectures using such a setup?

There is a good reason for the lowest level cache (aka "LLC"; the one closet to memory) being inclusive in pretty much all multi-socket architectures. When the LLC is inclusive of all closer-to-the-cores caches, coherency traffic is only needed between LLCs, and only for coherency state transitions at the LLC level. The hit and miss rates (in ops/sec, not in %) in LLCs are orders of magnitude smaller than this win L1 (and L2 where it exists). If the LLC was not inclusive, state changes in the inner caches would need to be communicated to all other caches, and the cross-socket coherency traffic volume would grow by a couple orders of magnitude, which simply isn't practical with chip-to-chip interconnects and pin counts.

Gil I'm not sure you are correct here. LLC on Linux normally refers to Last Level Cache, not "lowest". AMDs L3 cache is a mostly exclusive victim buffer. Intel are inclusive at L3 while AMD are mostly exclusive.