I'm missing something for sure, because if it was true, any (single-threaded) "protocol" that rely on the order of writes/loads against (not mapped) ByteBuffers to be fast (ie: sequential writes rocks :P) risks to not see the order respected if not using patterns that force the compiler to block the re-ordering of such instructions (Sci-Fi hypothesis).
with great regards,Francesco
Doesn't hardware already reorder memory writes along 64 byte boundaries? They're called cache lines.
Dave
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
This is indeed what I was expecting...While others archs (PowerPC , tons of ARMs and the legendary Alpha DEC) are allowed to be pretty creative in matter of reordering...And that's the core of my question: how much a developer could rely on the fact that a compiler ( or the underline HW) will respect the memory access that he has put into the code without using any fences?The answer is really a "depends on the compiler/architecture"?Or exist common high level patterns respected by the "most" of compilers/architectures?
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/EMIBqjX4uzk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
This is indeed what I was expecting...While others archs (PowerPC , tons of ARMs and the legendary Alpha DEC) are allowed to be pretty creative in matter of reordering...And that's the core of my question: how much a developer could rely on the fact that a compiler ( or the underline HW) will respect the memory access that he has put into the code without using any fences?The answer is really a "depends on the compiler/architecture"?Or exist common high level patterns respected by the "most" of compilers/architectures?
Il lun 16 gen 2017, 22:14 Vitaly Davidovich <vit...@gmail.com> ha scritto:
Depends on which hardware. For instance, x86/64 is very specific about what memory operations can be reordered (for cacheable operations), and two stores aren't reordered. The only reordering is stores followed by loads, where the load can appear to reorder with the preceding store.
On Mon, Jan 16, 2017 at 4:02 PM Dave Cheney <da...@cheney.net> wrote:
Doesn't hardware already reorder memory writes along 64 byte boundaries? They're called cache lines.
Dave
On Tue, 17 Jan 2017, 05:35 Tavian Barnes <tavia...@gmail.com> wrote:
On Monday, 16 January 2017 12:38:01 UTC-5, Francesco Nigro wrote:I'm missing something for sure, because if it was true, any (single-threaded) "protocol" that rely on the order of writes/loads against (not mapped) ByteBuffers to be fast (ie: sequential writes rocks :P) risks to not see the order respected if not using patterns that force the compiler to block the re-ordering of such instructions (Sci-Fi hypothesis).I don't think you're missing anything. The JVM would be stupid to reorder your sequential writes into random writes, but it's perfectly within its right to do so for a single-threaded program according to the JMM, as long as it respects data dependencies (AFAIK). Of course, that would be a huge quality of implementation issue, but that's an entirely separate class from correctness issues.with great regards,Francesco
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--Sent from my phone--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/EMIBqjX4uzk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.
"Assume nothing" is a pretty scientific approach,I like it :)
But this (absence of) assumption lead me to think about another couple of things: what about all the encoders/decoders or any program that rely on data access patterns to pretend to be and remain "fast"?
Writing mechanichal sympathetic code means being smart enough to trick the compiler in new and more creative ways or check what a compiler doesn't handle so well,trying to manage it by yourself?
The last one is a very strong statement just for the sake of discussion..But I'm curious to know what the guys of this group think about it :)
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Atomicity of values isn't something I'd assume happens automatically. Word tearing isn't observable from single threaded code.
I'm all for experimenting and digging into what the generated code actually looks like. Just remember that it's a moving target, and that you should not assume that the code choices will look similar in the future.
E.g. I'm working on a presentation that demonstrations of how our JITs make use of the capabilities of newer Intel cores. To study machine code, I generally "make it hot" (e.g. with jmh, using DONT_INLINE directives) and then look at it with ZVision (a feature of Zing), which is very useful because the instruction-level profiling ticks allow me quickly identify and zoom into the actual loop with the code I want to study (and ignore the 90%+ rest of the machine code that is not the actual hot loop).
Below is a short sequence from my work-in-progress slide deck. It demonstrates how much of a moving target code-gen is for the exact same code, and even the exact same compiler (e.g. varying according to the core you are running on). In this case you are looking at what Zing's Falcon JIT generates fro Westmere vs. Broadwell for the same simple loop. But the same sort of variance and adaptability will probably be there for other JITs and static compilers.Here are two different
On Tuesday, January 17, 2017 at 2:41:21 AM UTC-5, Francesco Nigro wrote:Thanks,"Assume nothing" is a pretty scientific approach,I like it :)
But this (absence of) assumption lead me to think about another couple of things: what about all the encoders/decoders or any program that rely on data access patterns to pretend to be and remain "fast"?
Writing mechanichal sympathetic code means being smart enough to trick the compiler in new and more creative ways or check what a compiler doesn't handle so well,trying to manage it by yourself?
The last one is a very strong statement just for the sake of discussion..But I'm curious to know what the guys of this group think about it :)
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.
On 01/17/2017 12:55 PM, Vitaly Davidovich wrote:
> Atomicity of values isn't something I'd assume happens automatically. Word
> tearing isn't observable from single threaded code.
On 01/17/2017 09:17 PM, Michael Barker wrote:
> That was my understanding too. Normal load/stores on 32 bit JVMs would tear 64
> bit values. Although, I think object references are guaranteed to be written
> atomically.
(Triggered by my pet peeve)
Please don't confuse "access atomicity" and "word tearing". Access atomicity
means reading/writing the value in full. Word tearing (at least per JLS 17.6)
means that fields are considered distinct, and the read/write of one field
cannot disturb neighbors.
Speaking of access atomicity, the upcoming value types would probably be
non-access-atomic by default too, because enforcing access atomicity for widths
larger than a machine one is painful. volatile longs/doubles had it easy on
32-bit VMs with 64-bit FPUs, compared to what variable-len values would have to
experience.
Thanks,
-Aleksey
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
Hi Gil,Your slides are really inspiring, especially for JIT code. Now, it's comparable with code produced by static C/C++ compilers. Have you compared a performance of this code with a code produced by ICC (Intel's compiler) for example?
BTW, it may be better for performance to schedule instruction in out-of-order manner (interchange independent instructions in order to maximize a distance between dependent instructions and so utilize as many execution ports as possible).
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--Sergey
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
----Sergey
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
In most cases the most heavy instructions are load/stores. So, I believe, in this case it's better to try to hide load latency then enable macro-fusion. BTW, I'm not sure about SKL/SKX, but for the previous generations macro-fusion depends on code alignment.--Sergey
On Wed, Jan 18, 2017 at 2:44 AM, Aleksey Shipilev <aleksey....@gmail.com> wrote:
(triggered again)
On 01/18/2017 12:33 AM, Sergey Melnikov wrote:
> mov (%rax), %rbx
> cmp %rbx, %rdx
> jxx Lxxx
>
> But if you schedule them this way
>
> mov (%rax), %rbx
> cmp %rbx, %rdx
> ... few instructions
> jxx Lxxx
...doesn't this give up on macro-fusion, and effectively sets up for a better
chance of a "bottleneck in instruction fetch/decode phases"? :)
-Aleksey
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
----Sergey
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.