Aeron zero'ing buffer?

362 views
Skip to first unread message

Peter Veentjer

unread,
May 28, 2017, 11:17:33 PM5/28/17
to mechanical-sympathy
In Martin Thompson's talks about Aeron he mentions writing threads doing an increment on an AtomicLong counter to claim a region in a buffer, but the initial 4 bytes  for the length field aren't written, but only on completion of the frame, the length is set. This signals to the reader of the buffer that this particular write is complete and provides a happens before relation.

My question is about this length field; who is responsible for zero'ing it out? Once the buffer has been written, the content could be total gibberish and if it isn't zero'd, the reading thread could falsely assume it is written and boom..

Todd Montgomery

unread,
May 28, 2017, 11:39:27 PM5/28/17
to mechanical-sympathy
The Conductor does this as part of its duty cycle.


Flow control and windowing prevents overrunning the cleaning position.

-- Todd

On Sun, May 28, 2017 at 8:17 PM, Peter Veentjer <alarm...@gmail.com> wrote:
In Martin Thompson's talks about Aeron he mentions writing threads doing an increment on an AtomicLong counter to claim a region in a buffer, but the initial 4 bytes  for the length field aren't written, but only on completion of the frame, the length is set. This signals to the reader of the buffer that this particular write is complete and provides a happens before relation.

My question is about this length field; who is responsible for zero'ing it out? Once the buffer has been written, the content could be total gibberish and if it isn't zero'd, the reading thread could falsely assume it is written and boom..

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Peter Veentjer

unread,
May 29, 2017, 12:36:55 AM5/29/17
to mechanical-sympathy
Thanks Todd.

Ok, that makes sense.

A related question. What happens if 2 threads do a plain write in the same cache line but independent locations.

If this happens concurrently, can the system run into a 'lost update'? I'm sure it can't and I guess the cache coherence protocol takes care it. But would like to get confirmation anyway.

Todd Montgomery

unread,
May 29, 2017, 1:48:57 AM5/29/17
to mechanical-sympathy
You have hit on false sharing. No update will be lost.... but, it will be slow. Both threads will race. One will win. The other will wait. This contention is the source of a lot of missed opportunity.

As is the case a lot of the time, my good friend Martin is much better about explaining it.


--

Martin Thompson

unread,
May 29, 2017, 1:50:22 AM5/29/17
to mechanical-sympathy
A related question. What happens if 2 threads do a plain write in the same cache line but independent locations.

If this happens concurrently, can the system run into a 'lost update'? I'm sure it can't and I guess the cache coherence protocol takes care it. But would like to get confirmation anyway.

If the writes are to independent, non-overlapping, addresses then no update will be lost even if in the same cache line. This will result in false sharing which is a performance issue but not a correctness issue.

Nitsan Wakart

unread,
May 29, 2017, 10:30:52 AM5/29/17
to mechanica...@googlegroups.com
In particular for Aeron LogBuffer false sharing on length writes can only happen (on systems with cache line length 64b) for 0 length messages as the data header size is 32b:

and messages are aligned to 32b:

This is ignoring prefetch induced false sharing which increases the boundaries of this issue to 128b, but which is also less severe.




Avi Kivity

unread,
May 29, 2017, 11:37:06 AM5/29/17
to mechanica...@googlegroups.com

Switching topics slightly, prefetch extending the effective cache line size was causing us some consternation, since we were never able to find where it was documented. Do you have a reference to it? When did it start happening?


It seems like it invalidates all software that was carefully written to honor 64 byte cache lines.


IIRC Pentium 4 had 128 byte "sectors", but it was never fully explained what these were, and the word died with the P4.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Martin Thompson

unread,
May 29, 2017, 11:54:20 AM5/29/17
to mechanical-sympathy

Switching topics slightly, prefetch extending the effective cache line size was causing us some consternation, since we were never able to find where it was documented. Do you have a reference to it? When did it start happening?


It seems like it invalidates all software that was carefully written to honor 64 byte cache lines.


IIRC Pentium 4 had 128 byte "sectors", but it was never fully explained what these were, and the word died with the P4.


I've seen adjacent cacheline prefetching on Intel processors since the Netburst days (well over a decade). Until Sandy bridge it was generally recommended to disable them because memory bandwidth often became an issue. These days it works on the L2 cache sitting along side a prefetcher that looks for patterns of cache line accesses. L1 has different prefetchers. It does have quite a noticeable effect on false sharing but not as much as when within the same 64 byte cache line.

Benedict Elliott Smith

unread,
May 29, 2017, 12:06:02 PM5/29/17
to mechanica...@googlegroups.com
It's approximately where you'd expect, in the Intel 64 and IA32 Architecture Optimization Reference Manual, under "Data Prefetching" on page 2-29, and referred to as the "Spatial prefetcher"

It is pretty easy to miss, given it's only afforded a single sentence.

It's possible to disable it on a per-core basis:



--

Peter Veentjer

unread,
May 30, 2017, 6:25:50 AM5/30/17
to mechanical-sympathy
Thanks everyone.

Another related question. When an atomic integer is used so that writers can claim their segment in the buffer, what prevent wrapping of this atomic integer and falsely assuming you have allocated a section in the buffer?

So imagine the buffer is full, and the atomic integer > buffer.length. Any thread that wants to write, keeps increasing this atomic integer far far beyond its maximum capacity of the buffer. Which in itself is fine because one can detect what kind write failure one had:
- first one the over commit
- subsequent over commit

But what if the value wraps? In theory one could end up thinking one has claimed a segment of memory still needed for reading purposes.

One simple way to reduce the problem is to use an atomic long instead of an atomic integer.

Martin Thompson

unread,
May 30, 2017, 7:06:32 AM5/30/17
to mechanical-sympathy
The claim is on a long with a check before increment.

Francesco Nigro

unread,
May 30, 2017, 3:22:09 PM5/30/17
to mechanical-sympathy
The Intel manual effectively is pretty short while explaining it, but anyway a "simple" check with proper tools could be a good experiment to validate it: https://joemario.github.io/blog/2016/09/01/c2c-blog/

I've asked to an Intel engineer about it some times ago (on 14th Feb)  and he answered me this:

Hi Francesco, 
 
About your questions on prefetchers:
  • Prefetchers normally kick in only after multiple cache lines in a specific pattern have been accessed. So I wouldn't worry too much for a single cache line.
  • Prefetchers tend to only read lines, so they by itself cannot cause additional classic false sharing (but may cause additional aborts on TSX).
  • The same is true for speculative execution.  You have more to fight than just prefetching; speculative execution tends to pull in lots of data early.   You can assume the cpu runs 150+ instructions ahead specualtively, if not more.
  • There shouldn't be an automatic "get the next line" as much as there are pattern recognizers, and if there's a sequential pattern, the next lines will be prefeteched. it's not unconditional.
You can always test by enabling/disabling the prefetchers:
  wrmsr -a 0x1a4 0xf    // to disable
  wrmsr -a 0x1a4 0x0   // to enable
See https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors for more info.
The wrmsr tool is available at: https://01.org/msr-tools/overview

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Francesco Nigro

unread,
May 30, 2017, 3:26:14 PM5/30/17
to mechanica...@googlegroups.com
The coolest thing IMHO about the Aeron LogBuffer is how it trades performances on concurrent writings with relaxed precision on backpressure: it is a really a very smart and effective idea!
Reply all
Reply to author
Forward
0 new messages