[llvm-dev] Non-Temporal hints from Loop Vectorizer

hameeza ahmed via llvm-dev

unread,

Jan 20, 2018, 12:44:28 PM1/20/18

to llvm-dev, Craig Topper

Hello,

My work deals with non-temporal loads and stores i found non-temporal meta data in llvm documentation but its not shown in IR.

How to get non-temporal meta data?

Simon Pilgrim via llvm-dev

unread,

Jan 20, 2018, 1:02:35 PM1/20/18

to hameeza ahmed, llvm-dev, Craig Topper

llvm\test\CodeGen\X86\nontemporal-loads.ll shows how to create nt vector
loads in IR - is that what you're after?

Simon.
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

hameeza ahmed via llvm-dev

unread,

Jan 20, 2018, 1:16:48 PM1/20/18

to Simon Pilgrim, Craig Topper, llvm-dev

Actually i am working on vector accelerator which will perform those instructions which are non temporal.

for instance if i have this loop

for(i=0;i<2048;i++)

a[i]=b[i]+c[i];

currently it emits following IR;

%0 = getelementptr inbounds [2048 x i32], [2048 x i32]* @b, i64 0, i64 %index

%1 = bitcast i32* %0 to <16 x i32>*

%wide.load = load <16 x i32>, <16 x i32>* %1, align 16, !tbaa !1

%8 = getelementptr inbounds [2048 x i32], [2048 x i32]* @c, i64 0, i64 %index

%9 = bitcast i32* %8 to <16 x i32>*

%wide.load14 = load <16 x i32>, <16 x i32>* %9, align 16, !tbaa !1

%16 = add nsw <16 x i32> %wide.load14, %wide.load

%20 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a, i64 0, i64 %index

%21 = bitcast i32* %20 to <16 x i32>*

store <16 x i32> %16, <16 x i32>* %21, align 16, !tbaa !1

However, i want it to emit following IR

%0 = getelementptr inbounds [2048 x i32], [2048 x i32]* @b, i64 0, i64 %index

%1 = bitcast i32* %0 to <16 x i32>*

%wide.load = load <16 x i32>, <16 x i32>* %1, align 16, !tbaa !1, !nontemporal !1

%8 = getelementptr inbounds [2048 x i32], [2048 x i32]* @c, i64 0, i64 %index

%9 = bitcast i32* %8 to <16 x i32>*

%wide.load14 = load <16 x i32>, <16 x i32>* %9, align 16, !tbaa !1, !nontemporal !1

%16 = add nsw <16 x i32> %wide.load14, %wide.load, !nontemporal !1

%20 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a, i64 0, i64 %index

%21 = bitcast i32* %20 to <16 x i32>*

store <16 x i32> %16, <16 x i32>* %21, align 16, !tbaa !1, !nontemporal !1

so that i can offload load, add, store to accelerator hardware. is it possible here? do i need a separate pass to detect whether the loop has non temporal data or polly will help here? what do you say?

Simon Pilgrim via llvm-dev

unread,

Jan 20, 2018, 1:26:26 PM1/20/18

to hameeza ahmed, Craig Topper, llvm-dev

From C/C++ you just need to use the __builtin_nontemporal_store/__builtin_nontemporal_load builtins to tag the stores/loads with the nontemporal flag.

for(i=0;i<2048;i++) {

__builtin_nontemporal_store( __builtin_nontemporal_load(b+i) + __builtin_nontemporal_load(c + i), a + i );

}

There may be an attribute you can tag pointers with instead but I don't know off hand.

hameeza ahmed via llvm-dev

unread,

Jan 20, 2018, 1:29:46 PM1/20/18

to Simon Pilgrim, llvm-dev

i have already seen usage of __builtin_nontemporal_store but i want to automate identification of non temporal loads/stores. i think i need to go for a pass. is it possiblee to detect non temporal loops without polly?

Hal Finkel via llvm-dev

unread,

Jan 21, 2018, 4:00:17 PM1/21/18

to hameeza ahmed, Simon Pilgrim, llvm-dev

On 01/20/2018 12:29 PM, hameeza ahmed via llvm-dev wrote:

i have already seen usage of __builtin_nontemporal_store but i want to automate identification of non temporal loads/stores. i think i need to go for a pass. is it possiblee to detect non temporal loops without polly?

Yes, but we don't have anything that does that right now. The cost modeling is non-trivial, however. In the loop below, which of those accesses would you expect to be nontemporal? All of those accesses span only 8 KB, and that's certainly smaller than many L1 caches. Turning those into nontemporal accesses could certainly lead to a performance regression for that loop, subsequent code, or both. If we do this more generally, I suspect that we'd need to split the loop so that small trip counts don't use them at all, and for larger trip counts, we don't disturb data-reuse opportunities that would otherwise exist.

-Hal

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

hameeza ahmed via llvm-dev

unread,

Jan 22, 2018, 4:26:38 PM1/22/18

to Hal Finkel, llvm-dev

Thank You.

If i execute the same vector sum code with greater number of iterations like 100000000000 will the non temporal loads and stores effective?

Reply all

Reply to author

Forward