Operation Reordering

763 views
Skip to first unread message

Francesco Nigro

unread,
Jan 16, 2017, 12:38:01 PM1/16/17
to mechanica...@googlegroups.com
HI guys!

maybe what I'm asking could sound like a very naive question, but in recent times I've found very problematic to understand if exists a "rule of thumb" to find out how the JVM (or replace it with anything SW/HW-wise that could mess with the order of the statements of your program) manage reordering of operations in case of single threaded execution. 

As a general rule I've always applied that the order would appear as the program order shows, but this implied that writing over a (regular Direct or array-based) ByteBuffer with a particular order is different than doing it against a MappedByteBuffer.
That rule has lead me, for example, to use explicit fences (for single threaded program too) as a mean to put "lines on the sands" and be sure that the compiler would access data in the order I'm expecting it will do it.

This perception is enforced further after reading (and if I've understood it) the article from Cliff Click about "Sea of nodes" in which every operation on Memory Mapped I/O will produce a new I/O state and that prevents any reordering between the 2 kind of operations (LOAD and STORE), while the others are treated only respecting data dependencies (in theory, there is no limit on reordering!!!!).
I'm missing something for sure, because if it was true, any (single-threaded) "protocol" that rely on the order of writes/loads against (not mapped) ByteBuffers to be fast (ie: sequential writes rocks :P) risks to not see the order respected if not using patterns that force the compiler to block the re-ordering of such instructions (Sci-Fi hypothesis).
A. Shipilev (on twitter) has pointed me to look at gcm.cpp to understand better what reordering are allowed to happen...but I'm not feeling confident to have understood it properly.
I'm not even sure that building an empiric experiment and reading the produced ASM could prove something useful, but what could happen for a very specific case (and a very specific strong memory model, the one of my laptop x86 CPU).
What do you think?

with great regards,
Francesco

Tavian Barnes

unread,
Jan 16, 2017, 1:35:47 PM1/16/17
to mechanical-sympathy
On Monday, 16 January 2017 12:38:01 UTC-5, Francesco Nigro wrote:
I'm missing something for sure, because if it was true, any (single-threaded) "protocol" that rely on the order of writes/loads against (not mapped) ByteBuffers to be fast (ie: sequential writes rocks :P) risks to not see the order respected if not using patterns that force the compiler to block the re-ordering of such instructions (Sci-Fi hypothesis).

I don't think you're missing anything.  The JVM would be stupid to reorder your sequential writes into random writes, but it's perfectly within its right to do so for a single-threaded program according to the JMM, as long as it respects data dependencies (AFAIK).  Of course, that would be a huge quality of implementation issue, but that's an entirely separate class from correctness issues.
 
with great regards,
Francesco

Dave Cheney

unread,
Jan 16, 2017, 4:02:55 PM1/16/17
to mechanical-sympathy

Doesn't hardware already reorder memory writes along 64 byte boundaries? They're called cache lines.

Dave


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Vitaly Davidovich

unread,
Jan 16, 2017, 4:14:57 PM1/16/17
to mechanica...@googlegroups.com
Depends on which hardware.  For instance, x86/64 is very specific about what memory operations can be reordered (for cacheable operations), and two stores aren't reordered.  The only reordering is stores followed by loads, where the load can appear to reorder with the preceding store.
--
Sent from my phone

Francesco Nigro

unread,
Jan 16, 2017, 4:49:39 PM1/16/17
to mechanica...@googlegroups.com

This is indeed what I was expecting...While others archs (PowerPC , tons of ARMs and the legendary Alpha DEC) are allowed to be pretty creative in matter of reordering...And that's the core of my question: how much a developer could rely on the fact that a compiler ( or the underline HW) will respect the memory access that he has put into the code without using any fences?The answer is really a "depends on the compiler/architecture"?Or exist common high level patterns respected by the "most" of compilers/architectures?


You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/EMIBqjX4uzk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,
Jan 16, 2017, 5:38:56 PM1/16/17
to mechanica...@googlegroups.com
The compiler's reordering generally DOES NOT depend on the hardware. Optimizations that result in reordering generally occur well before instruction selection, and will happen in the same way for different hardware architectures. E.g. on X86, PowerPC, and ARM, HotSpot, gcc, and clang will all frequently reorder two stores, two loads, and any pair of loads and stores as long as there is nothing to explicitly prevent doing so. So forget about any sort of "hardware model" when thinking about the order you can expect.

The simple rule is "assume nothing". If a reordering is not specifically prohibited, assume it will happen. If you assume otherwise, you and likely to be unpleasantly surprised.

As for "stupid" reorderings, stupidity is in the eye of the beholder. "Surprising" sequence-breaking reordering that you may not see an immediate or obvious reason for may be beneficial in many ways.

E.g. take the following simple loop:

int[] a, b, c;
...
for (int i = 0; i < a.length; i++) {
    a[i] = b[i] + c[i];
}

You can certainly expect (due to causality) loads from b[i] to occur before stores to a[i]. But is it reasonable to expect loads of b[i+1] to happen AFTER stores to a[i]? After all, that's the order of operations in the program, right? Would the JVM be "stupid" to reorder things such that some stores to a[i+1] occur before some loads from b[i]?

A simple optimization which most compilers will hopefully do to the above loop is to use vector operations (SSE, AVX, etc.) on processors capable of them, coupled with loop unrolling. E.g. in practice, the bulk of the loop will be executing on 8 slots at a time on modern AVX2 x86 cpus, and multiple such 8 slot operations could be in flight at the same time (due both to the compiler unrolling the loop and the processor aggressively doing OOOE even without the compiler unrolling stuff). The loads from one such operation are absolutely allowed to [and even likely to] occur before stores that occur previously in the instruction stream (yes, even on x86, but also because the compiler may jumble them any way it wants in the unrolling). There is nothing "stupid" about that, and we should all hope that both the compiler and the hardware will feel free to jumble that order to get the best speed for this loop... 

Even without vectorizing or loop unrolling by the compiler, the CPU is free to reorder things in many ways. E.g. if one operation misses in the cache and the next (in logical "i" sequence order) hits in the cache, there is no reason for order to be maintained between the earlier store and the subsequent load. Now imagine that in a processor that can juggle 72 in flight loads and 48 in flight stores at the same time (e.g. a Haswell core), and you will quickly realize that any expectation of order or sequence of memory access not explicitly required should be left at the door.


On Monday, January 16, 2017 at 4:49:39 PM UTC-5, Francesco Nigro wrote:

This is indeed what I was expecting...While others archs (PowerPC , tons of ARMs and the legendary Alpha DEC) are allowed to be pretty creative in matter of reordering...And that's the core of my question: how much a developer could rely on the fact that a compiler ( or the underline HW) will respect the memory access that he has put into the code without using any fences?The answer is really a "depends on the compiler/architecture"?Or exist common high level patterns respected by the "most" of compilers/architectures?


Il lun 16 gen 2017, 22:14 Vitaly Davidovich <vit...@gmail.com> ha scritto:
Depends on which hardware.  For instance, x86/64 is very specific about what memory operations can be reordered (for cacheable operations), and two stores aren't reordered.  The only reordering is stores followed by loads, where the load can appear to reorder with the preceding store.
On Mon, Jan 16, 2017 at 4:02 PM Dave Cheney <da...@cheney.net> wrote:

Doesn't hardware already reorder memory writes along 64 byte boundaries? They're called cache lines.



Dave




On Tue, 17 Jan 2017, 05:35 Tavian Barnes <tavia...@gmail.com> wrote:
On Monday, 16 January 2017 12:38:01 UTC-5, Francesco Nigro wrote:
I'm missing something for sure, because if it was true, any (single-threaded) "protocol" that rely on the order of writes/loads against (not mapped) ByteBuffers to be fast (ie: sequential writes rocks :P) risks to not see the order respected if not using patterns that force the compiler to block the re-ordering of such instructions (Sci-Fi hypothesis).

I don't think you're missing anything.  The JVM would be stupid to reorder your sequential writes into random writes, but it's perfectly within its right to do so for a single-threaded program according to the JMM, as long as it respects data dependencies (AFAIK).  Of course, that would be a huge quality of implementation issue, but that's an entirely separate class from correctness issues.
 
with great regards,
Francesco








--


You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.


To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.



For more options, visit https://groups.google.com/d/optout.










--


You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.


To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.



For more options, visit https://groups.google.com/d/optout.


--
Sent from my phone

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/EMIBqjX4uzk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Francesco Nigro

unread,
Jan 17, 2017, 2:41:21 AM1/17/17
to mechanical-sympathy
Thanks,

"Assume nothing" is a pretty scientific approach,I like it :)
But this (absence of) assumption lead me to think about another couple of things: what about all the encoders/decoders or any program that rely on data access patterns to pretend to be and remain "fast"?
Writing mechanichal sympathetic code means being smart enough to trick the compiler in new and more creative ways or check what a compiler doesn't handle so well,trying to manage it by yourself?
The last one is a very strong statement just for the sake of discussion..But I'm curious to know what the guys of this group think about it :)

Gil Tene

unread,
Jan 17, 2017, 3:48:04 AM1/17/17
to mechanica...@googlegroups.com
I'm all for experimenting and digging into what the generated code actually looks like. Just remember that it's a moving target, and that you should not assume that the code choices will look similar in the future.

E.g. I'm working on a presentation that demonstrates of how our JITs make use of the capabilities of newer Intel cores. To study machine code, I generally "make it hot" (e.g. with jmh, using DONT_INLINE directives) and then look at it with ZVision (a feature of Zing), which is very useful because the instruction-level profiling ticks allow me quickly identify and zoom into the actual loop with the code I want to study (and ignore the 90%+ rest of the machine code that is not the actual hot loop).

Below is a short sequence from my work-in-progress slide deck. It demonstrates how much of a moving target code-gen is for the exact same code, and even the exact same compiler (e.g. varying according to the core you are running on). In this case you are looking at what Zing's Falcon JIT generates fro Westmere vs. Broadwell for the same simple loop. But the same sort of variance and adaptability will probably be there for other JITs and static compilers.








Here are two different 

Nitsan Wakart

unread,
Jan 17, 2017, 4:59:27 AM1/17/17
to mechanica...@googlegroups.com
"what about all the encoders/decoders or any program that rely on data access patterns to pretend to be and remain "fast"?"
There's no problem in reordering while maintaining observable effects, right?
You should assume a compiler interprets "observable effects" to mean "order imposed by memory model". Hence Gil's approach reflects this as "assume nothing" about plain memory access code, except that it will appear to have executed in program order when it is "observable". You can "cheat" and observe the effects of "plain access" code as it executes (e.g. read the mutated memory from another thread, or observe it using a native debugger), but the compiler is not bound to order what you'll see in any particular way (hence reordering and eliminating loads/stores etc is fine).

"how much a developer could rely on the fact that a compiler (or the underline HW) will respect the memory access that he has put into the code without using any fences?"
Not at all... OK, just a little bit:
- You can assume program order will magically work out in the end (when observable)
- You can assume atomicity of values (no word tearing)
That's pretty much it.
For decoders/encoders:
- If you expect the decoded/encoded contents to be read from another thread you need fences/barriers.
- If you expect the decoded/encoded contents to be read from same thread, the appearance of program order should sort you out.

It doesn't half hurt to look at generated code and see if the compiler is being stupid and try and help it out, but the validity of such an effort can be very limited in effect (because HW/profile/compiler differences will lead to nullifying your efforts)

Vitaly Davidovich

unread,
Jan 17, 2017, 6:55:53 AM1/17/17
to mechanica...@googlegroups.com
Atomicity of values isn't something I'd assume happens automatically.  Word tearing isn't observable from single threaded code.

I think the only thing you can safely and portably assume is the high level "single threaded observable behavior will occur" statement.  It's also interesting to note that "passage of time" isn't one of those observable effects, despite sometimes being THE effect you're after (e.g. crypto).
--


You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.


To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.



For more options, visit https://groups.google.com/d/optout.


Michael Barker

unread,
Jan 17, 2017, 3:17:42 PM1/17/17
to mechanica...@googlegroups.com
Atomicity of values isn't something I'd assume happens automatically.  Word tearing isn't observable from single threaded code.

That was my understanding too.  Normal load/stores on 32 bit JVMs would tear 64 bit values.  Although, I think object references are guaranteed to be written atomically. 

Aleksey Shipilev

unread,
Jan 17, 2017, 3:39:34 PM1/17/17
to mechanica...@googlegroups.com
On 01/17/2017 12:55 PM, Vitaly Davidovich wrote:
> Atomicity of values isn't something I'd assume happens automatically. Word
> tearing isn't observable from single threaded code.

On 01/17/2017 09:17 PM, Michael Barker wrote:
> That was my understanding too. Normal load/stores on 32 bit JVMs would tear 64
> bit values. Although, I think object references are guaranteed to be written
> atomically.

(Triggered by my pet peeve)

Please don't confuse "access atomicity" and "word tearing". Access atomicity
means reading/writing the value in full. Word tearing (at least per JLS 17.6)
means that fields are considered distinct, and the read/write of one field
cannot disturb neighbors.

Speaking of access atomicity, the upcoming value types would probably be
non-access-atomic by default too, because enforcing access atomicity for widths
larger than a machine one is painful. volatile longs/doubles had it easy on
32-bit VMs with 64-bit FPUs, compared to what variable-len values would have to
experience.

Thanks,
-Aleksey


signature.asc

Sergey Melnikov

unread,
Jan 17, 2017, 3:55:24 PM1/17/17
to mechanical-sympathy, g...@azul.com
​Hi Gil,

​Your ​slides are really inspiring, especially for JIT code. Now, it's comparable with code produced by static C/C++ compilers. Have you compared a performance of this code with a code produced by ICC (Intel's compiler) for example?

BTW, it may be better for performance to schedule instruction in out-of-order manner (interchange independent instructions in order to maximize a distance between dependent instructions and so utilize as many execution ports as possible). And small hint: there is no need to have rbx register calculations here, so, it may be hoisted out of loop (it would reduce register pressure); more sophisticated addressing mode may help for reduction rcx calculation.

P.S. It's great to see nop instruction at 0x3001455e address for proper code alignment ;-)

--
​Sergey​


On Jan 17, 2017 11:48 AM, "Gil Tene" <g...@azul.com> wrote:
I'm all for experimenting and digging into what the generated code actually looks like. Just remember that it's a moving target, and that you should not assume that the code choices will look similar in the future.

E.g. I'm working on a presentation that demonstrations of how our JITs make use of the capabilities of newer Intel cores. To study machine code, I generally "make it hot" (e.g. with jmh, using DONT_INLINE directives) and then look at it with ZVision (a feature of Zing), which is very useful because the instruction-level profiling ticks allow me quickly identify and zoom into the actual loop with the code I want to study (and ignore the 90%+ rest of the machine code that is not the actual hot loop).

Below is a short sequence from my work-in-progress slide deck. It demonstrates how much of a moving target code-gen is for the exact same code, and even the exact same compiler (e.g. varying according to the core you are running on). In this case you are looking at what Zing's Falcon JIT generates fro Westmere vs. Broadwell for the same simple loop. But the same sort of variance and adaptability will probably be there for other JITs and static compilers.








Here are two different 


On Tuesday, January 17, 2017 at 2:41:21 AM UTC-5, Francesco Nigro wrote:
Thanks,

"Assume nothing" is a pretty scientific approach,I like it :)
But this (absence of) assumption lead me to think about another couple of things: what about all the encoders/decoders or any program that rely on data access patterns to pretend to be and remain "fast"?
Writing mechanichal sympathetic code means being smart enough to trick the compiler in new and more creative ways or check what a compiler doesn't handle so well,trying to manage it by yourself?
The last one is a very strong statement just for the sake of discussion..But I'm curious to know what the guys of this group think about it :)

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

Vitaly Davidovich

unread,
Jan 17, 2017, 4:02:18 PM1/17/17
to mechanical-sympathy
On Tue, Jan 17, 2017 at 3:39 PM, Aleksey Shipilev <aleksey....@gmail.com> wrote:
On 01/17/2017 12:55 PM, Vitaly Davidovich wrote:
> Atomicity of values isn't something I'd assume happens automatically.  Word
> tearing isn't observable from single threaded code.

On 01/17/2017 09:17 PM, Michael Barker wrote:
> That was my understanding too.  Normal load/stores on 32 bit JVMs would tear 64
> bit values.  Although, I think object references are guaranteed to be written
> atomically.

(Triggered by my pet peeve)

Please don't confuse "access atomicity" and "word tearing". Access atomicity
means reading/writing the value in full. Word tearing (at least per JLS 17.6)
means that fields are considered distinct, and the read/write of one field
cannot disturb neighbors.
Yeah, "word tearing" is an overloaded term.  You can also consider splitting a word across cachelines as potentially causing a tear as a store/load involves two cachelines.  But my point was really that word tearing and access atomicity (or lack thereof) aren't observable in single threaded code. 

Speaking of access atomicity, the upcoming value types would probably be
non-access-atomic by default too, because enforcing access atomicity for widths
larger than a machine one is painful. volatile longs/doubles had it easy on
32-bit VMs with 64-bit FPUs, compared to what variable-len values would have to
experience.
I don't see how they could be access atomic either.

Thanks,
-Aleksey


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Vitaly Davidovich

unread,
Jan 17, 2017, 4:11:57 PM1/17/17
to mechanical-sympathy, Gil Tene
On Tue, Jan 17, 2017 at 3:55 PM, Sergey Melnikov <melnikov...@gmail.com> wrote:
​Hi Gil,

​Your ​slides are really inspiring, especially for JIT code. Now, it's comparable with code produced by static C/C++ compilers. Have you compared a performance of this code with a code produced by ICC (Intel's compiler) for example?
I believe the Falcon JIT that Gil refers to is Azul's LLVM-based JIT, so you get the same middle/backend as Clang.

As for ICC, I'm not even sure Clang+LLVM (or even GCC) autovectorize as well as ICC (at least that used to be the case).

BTW, it may be better for performance to schedule instruction in out-of-order manner (interchange independent instructions in order to maximize a distance between dependent instructions and so utilize as many execution ports as possible).
Pretty sure OOO cores will do a good job themselves for scheduling provided you don't bottleneck in instruction fetch/decode phases or create other pipeline hazards.  If you artificially increase distance between dependent instructions, you may cause instructions to hang out in the reorder buffers longer before they can retire.
 
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Sergey Melnikov

unread,
Jan 17, 2017, 6:33:04 PM1/17/17
to mechanical-sympathy, Gil Tene
>> Pretty sure OOO cores will do a good job themselves for scheduling provided you don't bottleneck in instruction fetch/decode phases or create other pipeline hazards.  If you artificially increase distance between dependent instructions, you may cause instructions to hang out in the reorder buffers longer before they can retire.

As usual, corner cases may have worse performance. And it's a tricky compromise between locality and OOO scheduling. But in general case, good instruction scheduling (out-of-order enabled) may bring additional up to dozen per cent of performance. It's important even for modern BigCores (HSW) besides Atom (Silvermont).

For example, if you have something like

mov (%rax), %rbx
cmp %rbx, %rdx
jxx Lxxx

It's difficult to execute these instructions in OOO-manner.
But if you schedule them this way

mov (%rax), %rbx
... few instructions
cmp %rbx, %rdx
... few instructions
jxx Lxxx

It would be possible to execute them out-of-order and calculate something additional.

Anyway, ICC does this kind of scheduling.


--Sergey
--

--Sergey

Aleksey Shipilev

unread,
Jan 17, 2017, 6:44:55 PM1/17/17
to mechanica...@googlegroups.com, Gil Tene
(triggered again)

On 01/18/2017 12:33 AM, Sergey Melnikov wrote:
> mov (%rax), %rbx
> cmp %rbx, %rdx
> jxx Lxxx
>
> But if you schedule them this way
>
> mov (%rax), %rbx
> cmp %rbx, %rdx
> ... few instructions
> jxx Lxxx

...doesn't this give up on macro-fusion, and effectively sets up for a better
chance of a "bottleneck in instruction fetch/decode phases"? :)

-Aleksey

signature.asc

Sergey Melnikov

unread,
Jan 17, 2017, 6:58:31 PM1/17/17
to mechanical-sympathy, Gil Tene
In most cases the most heavy instructions are load/stores. So, I believe, in this case it's better to try to hide load latency then enable macro-fusion. BTW, I'm not sure about SKL/SKX, but for the previous generations macro-fusion depends on code alignment.

--Sergey

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

--Sergey

Vitaly Davidovich

unread,
Jan 17, 2017, 7:00:14 PM1/17/17
to mechanica...@googlegroups.com, Gil Tene
Hmm, I've never seen such scheduling (doesn't mean it doesn't exist of course) for OoO cores.  Besides what Aleksey said about macro fusion, what happens to the flags register in between the cmp and the jmp?

It's also hard to look at these few instructions in isolation.  For example, the mov+cmp+jmp sequence itself may start executing "ahead of time".

The only scheduling type of thing I've seen by compilers is breaking dependency chains to enable super scalar execution (e.g. loop unrolling and doing reassociative operations in parallel).

I also recall Mike Pall (of LuaJIT fame) mentioning that he doesn't think manual OoO scheduling makes sense for modern OoO chips.

I'm no expert on this, but breaking cmp+jmp sequences like that seems strange.  Particularly the cmp+jmp due to fusion and flags being set.

--
​Sergey​


To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.



For more options, visit https://groups.google.com/d/optout.










--


You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.


To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.



For more options, visit https://groups.google.com/d/optout.










--


You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.


To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.



For more options, visit https://groups.google.com/d/optout.





--

--Sergey










--


You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.


To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.



For more options, visit https://groups.google.com/d/optout.


--
Sent from my phone

Vitaly Davidovich

unread,
Jan 17, 2017, 7:08:40 PM1/17/17
to mechanica...@googlegroups.com, Gil Tene
The cache miss latency can be hidden either by this load being done ahead of time or if there're other instructions that can execute while this load is outstanding.  So breaking dependency chains is good, but extending the distance like this seems weird and may hurt common cases.  If ICC does this type of thing, it'd be interesting to know how it arrives at this decision.

These early loads sound analogous to software prefetch hints, but those are rife with problems and have very limited application where they win over hardware.

On Tue, Jan 17, 2017 at 6:58 PM Sergey Melnikov <melnikov...@gmail.com> wrote:
In most cases the most heavy instructions are load/stores. So, I believe, in this case it's better to try to hide load latency then enable macro-fusion. BTW, I'm not sure about SKL/SKX, but for the previous generations macro-fusion depends on code alignment.

--Sergey
On Wed, Jan 18, 2017 at 2:44 AM, Aleksey Shipilev <aleksey....@gmail.com> wrote:
(triggered again)





On 01/18/2017 12:33 AM, Sergey Melnikov wrote:


> mov (%rax), %rbx


> cmp %rbx, %rdx


> jxx Lxxx


>


> But if you schedule them this way


>


> mov (%rax), %rbx


> cmp %rbx, %rdx


> ... few instructions


> jxx Lxxx





...doesn't this give up on macro-fusion, and effectively sets up for a better


chance of a "bottleneck in instruction fetch/decode phases"? :)





-Aleksey





--


You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.


To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.



For more options, visit https://groups.google.com/d/optout.





--

--Sergey










--


You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.


To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.



For more options, visit https://groups.google.com/d/optout.


--
Sent from my phone

Dave Cheney

unread,
Jan 17, 2017, 7:15:06 PM1/17/17
to mechanical-sympathy
> Yeah, "word tearing" is an overloaded term. You can also consider splitting a word across cachelines as potentially causing a tear as a store/load involves two cachelines

If a word was split across cache lines, it is by definition not
aligned, so the guarantees of an atomic write don't apply, cache or no
cache.
>> email to mechanical-symp...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-symp...@googlegroups.com.

Vitaly Davidovich

unread,
Jan 17, 2017, 7:19:05 PM1/17/17
to mechanica...@googlegroups.com
I understand.  My point was you could induce tearing of a single field via that scenario and not just via a CPU that doesn't have low level load/store granularity that can impact neighbors due to too wide of a load/store.
--
Sent from my phone

Vitaly Davidovich

unread,
Jan 17, 2017, 7:23:56 PM1/17/17
to mechanica...@googlegroups.com, Gil Tene
And should also mention that doing very early load scheduling will increase register pressure as that value will need to be kept live across more instructions.  Stack spills and reloads suck in a hot/tight code sequence.

Nitsan Wakart

unread,
Jan 18, 2017, 3:06:48 PM1/18/17
to mechanica...@googlegroups.com

"- You can assume atomicity of values (no word tearing)"

This seems to have tickled people, I apologise for my imprecise wording.

Better wording would be:
"You can assume atomicity of read/writes, and no word tearing, to the extent these are promised to you by the spec"
- long/double plain writes on 32bit are not atomic. Other on heap 'none Unsafe writes to fields/arrays are atomic.
- anything larger than a byte that you write/read using Unsafe has no atomicity guarantees. If you do any offheap stuff, here's looking at you.
- if you write to a single field/array element you are guaranteed no neighbouring fields/elements will be overwritten as a result.
Point being, you can expect some things.
Reply all
Reply to author
Forward
0 new messages