It is possible that the issue with scheduling is constrained due to pointer-aliasing assumptions. Could you share the source for the loop in question?
RIP-relative indexing, as I recall, is a feature of position-independent code. Based on what's below, it might cause problems by making the instruction encodings large. cc'ing some Intel folks for further comments.
-Hal
_______________________________________________ LLVM Developers mailing list llvm...@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory
Hi Ahmed,
From what can be seen in the code snippet you provided, the reuse of XMM0 and XMM1 across loop-unroll instances does not inhibit instruction-level parallelism.
Modern X86 processors use register renaming that can eliminate the dependencies in the instruction stream. In the example you provided, the processor should be able to identify the 2-vloads + vadd + vstore sequences as independent and pipeline their execution.
Thanks, Zvi
From: Hal Finkel [mailto:hfi...@anl.gov]
Sent: Saturday, June 24, 2017 05:17
To: hameeza ahmed <hahme...@gmail.com>; llvm...@lists.llvm.org
Cc: Demikhovsky, Elena <elena.de...@intel.com>; Rackover, Zvi <zvi.ra...@intel.com>; Breger, Igor <igor....@intel.com>; craig....@gmail.com
Subject: Re: [llvm-dev] AVX Scheduling and Parallelism
It is possible that the issue with scheduling is constrained due to pointer-aliasing assumptions. Could you share the source for the loop in question?
---------------------------------------------------------------------
Intel Israel (74) Limited
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
Hi, Zvi,
I agree. In the context of targeting the KNL, however, I'm a bit concerned about the addressing, and specifically, the size of the resulting encoding:
vpaddd zmm1, zmm1, zmmword ptr [rax + b+401344] ; zmm1<-zmm1+b[401344]vmovdqu32 zmm0, zmmword ptr [rax + c+401280] ;load b[401280] in zmm0
The KNL can only deliver 16 bytes per cycle from the icache to the decoder. Essentially all of the instructions in the loop, as we seem to generate it, have 10-byte encodings:
10: 62 f1 7e 48 6f 80 00 vmovdqu32 0x0(%rax),%zmm0
17: 00 00 00
16: R_X86_64_32S c+0x61f00
...