Actually i am working on vector accelerator which will perform those instructions which are non temporal.
for instance if i have this loop
for(i=0;i<2048;i++)
a[i]=b[i]+c[i];
currently it emits following IR;
%0 = getelementptr inbounds [2048 x i32], [2048 x i32]* @b, i64 0, i64 %index
%1 = bitcast i32* %0 to <16 x i32>*
%wide.load = load <16 x i32>, <16 x i32>* %1, align 16, !tbaa !1
%8 = getelementptr inbounds [2048 x i32], [2048 x i32]* @c, i64 0, i64 %index
%9 = bitcast i32* %8 to <16 x i32>*
%wide.load14 = load <16 x i32>, <16 x i32>* %9, align 16, !tbaa !1
%16 = add nsw <16 x i32> %wide.load14, %wide.load
%20 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a, i64 0, i64 %index
%21 = bitcast i32* %20 to <16 x i32>*
store <16 x i32> %16, <16 x i32>* %21, align 16, !tbaa !1
However, i want it to emit following IR
%0 = getelementptr inbounds [2048 x i32], [2048 x i32]* @b, i64 0, i64 %index
%1 = bitcast i32* %0 to <16 x i32>*
%wide.load = load <16 x i32>, <16 x i32>* %1, align 16, !tbaa !1, !nontemporal !1
%8 = getelementptr inbounds [2048 x i32], [2048 x i32]* @c, i64 0, i64 %index
%9 = bitcast i32* %8 to <16 x i32>*
%wide.load14 = load <16 x i32>, <16 x i32>* %9, align 16, !tbaa !1, !nontemporal !1
%16 = add nsw <16 x i32> %wide.load14, %wide.load, !nontemporal !1
%20 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a, i64 0, i64 %index
%21 = bitcast i32* %20 to <16 x i32>*
store <16 x i32> %16, <16 x i32>* %21, align 16, !tbaa !1, !nontemporal !1
so that i can offload load, add, store to accelerator hardware. is it possible here? do i need a separate pass to detect whether the loop has non temporal data or polly will help here? what do you say?