[llvm-dev] Non-Temporal hints from Loop Vectorizer

134 views
Skip to first unread message

hameeza ahmed via llvm-dev

unread,
Jan 20, 2018, 12:44:28 PM1/20/18
to llvm-dev, Craig Topper
Hello,

My work deals with non-temporal loads and stores i found non-temporal meta data in llvm documentation but its not shown in IR.

How to get non-temporal meta data?

Simon Pilgrim via llvm-dev

unread,
Jan 20, 2018, 1:02:35 PM1/20/18
to hameeza ahmed, llvm-dev, Craig Topper
llvm\test\CodeGen\X86\nontemporal-loads.ll shows how to create nt vector
loads in IR - is that what you're after?

Simon.
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

hameeza ahmed via llvm-dev

unread,
Jan 20, 2018, 1:16:48 PM1/20/18
to Simon Pilgrim, Craig Topper, llvm-dev
Actually i am working on vector accelerator which will perform those instructions which are non temporal.

for instance if i have this loop

for(i=0;i<2048;i++)
a[i]=b[i]+c[i];

currently it emits following IR;


  %0 = getelementptr inbounds [2048 x i32], [2048 x i32]* @b, i64 0, i64 %index
  %1 = bitcast i32* %0 to <16 x i32>*
  %wide.load = load <16 x i32>, <16 x i32>* %1, align 16, !tbaa !1
  %8 = getelementptr inbounds [2048 x i32], [2048 x i32]* @c, i64 0, i64 %index
  %9 = bitcast i32* %8 to <16 x i32>*
  %wide.load14 = load <16 x i32>, <16 x i32>* %9, align 16, !tbaa !1
  %16 = add nsw <16 x i32> %wide.load14, %wide.load
  %20 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a, i64 0, i64 %index
  %21 = bitcast i32* %20 to <16 x i32>*
  store <16 x i32> %16, <16 x i32>* %21, align 16, !tbaa !1


However, i want it to emit following IR 

  %0 = getelementptr inbounds [2048 x i32], [2048 x i32]* @b, i64 0, i64 %index
  %1 = bitcast i32* %0 to <16 x i32>*
  %wide.load = load <16 x i32>, <16 x i32>* %1, align 16, !tbaa !1, !nontemporal !1
  %8 = getelementptr inbounds [2048 x i32], [2048 x i32]* @c, i64 0, i64 %index
  %9 = bitcast i32* %8 to <16 x i32>*
  %wide.load14 = load <16 x i32>, <16 x i32>* %9, align 16, !tbaa !1, !nontemporal !1
  %16 = add nsw <16 x i32> %wide.load14, %wide.load, !nontemporal !1
  %20 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a, i64 0, i64 %index
  %21 = bitcast i32* %20 to <16 x i32>*
  store <16 x i32> %16, <16 x i32>* %21, align 16, !tbaa !1, !nontemporal !1

so that i can offload load, add, store to accelerator hardware. is it possible here? do i need a separate pass to detect whether the loop has non temporal data or polly will help here? what do you say?





Simon Pilgrim via llvm-dev

unread,
Jan 20, 2018, 1:26:26 PM1/20/18
to hameeza ahmed, Craig Topper, llvm-dev
From C/C++ you just need to use the __builtin_nontemporal_store/__builtin_nontemporal_load builtins to tag the stores/loads with the nontemporal flag.

for(i=0;i<2048;i++) {
  __builtin_nontemporal_store( __builtin_nontemporal_load(b+i) + __builtin_nontemporal_load(c + i), a + i );
}

There may be an attribute you can tag pointers with instead but I don't know off hand.

hameeza ahmed via llvm-dev

unread,
Jan 20, 2018, 1:29:46 PM1/20/18
to Simon Pilgrim, llvm-dev
i have already seen usage of __builtin_nontemporal_store but i want to automate identification of non temporal loads/stores. i think i need to go for a pass. is it possiblee to detect non temporal loops without polly? 

Hal Finkel via llvm-dev

unread,
Jan 21, 2018, 4:00:17 PM1/21/18
to hameeza ahmed, Simon Pilgrim, llvm-dev


On 01/20/2018 12:29 PM, hameeza ahmed via llvm-dev wrote:
i have already seen usage of __builtin_nontemporal_store but i want to automate identification of non temporal loads/stores. i think i need to go for a pass. is it possiblee to detect non temporal loops without polly?

Yes, but we don't have anything that does that right now. The cost modeling is non-trivial, however. In the loop below, which of those accesses would you expect to be nontemporal? All of those accesses span only 8 KB, and that's certainly smaller than many L1 caches. Turning those into nontemporal accesses could certainly lead to a performance regression for that loop, subsequent code, or both. If we do this more generally, I suspect that we'd need to split the loop so that small trip counts don't use them at all, and for larger trip counts, we don't disturb data-reuse opportunities that would otherwise exist.

 -Hal

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

hameeza ahmed via llvm-dev

unread,
Jan 22, 2018, 4:26:38 PM1/22/18
to Hal Finkel, llvm-dev
Thank You.

If i execute the same vector sum code with greater number of iterations like 100000000000 will the non temporal loads and stores effective?
Reply all
Reply to author
Forward
0 new messages