[llvm-dev] Loop-vectorizer prototype for the EPI Project based on the RISC-V Vector Extension (Scalable vectors)

Thanks and Regards,
Vineet Kumar - vineet...@bsc.es
Barcelona Supercomputing Center - Centro Nacional de Supercomputación

Hi all,

Following up on the discussion in the last meeting about auto-
vectorization for RISC-V Vector extension (scalable vectors) at the
Barcelona Supercomputing Center, here are some additional details. 

We have a working prototype for end-to-end compilation targeting the
RISC-V Vector extension. The auto-vectorizer supports two strategies to
generate LLVM IR using scalable vectors:

1) Generate a vector loop using VF (vscale x k) = whole vector register
width, followed by a scalar tail loop.

2) Generate only a vector loop with active vector length controlled by
the RISC-V `vsetvli` instruction and using Vector Predicated intrinsics
(https://reviews.llvm.org/D57504). (Of course, intrinsics come with
their own limitations but we feel it serves as a good proof of concept
for our use case.) We also extend the VPlan to generate VPInstructions
that are expanded using predicated intrinsics.

We also considered a third hybrid approach of having a vector loop with
VF = whole register width, followed by a vector tail loop using
predicated intrinsics. For now though, based on project requirements,
we favoured the second approach.

We have also taken care to not break any fixed-vector implementation.
All the scalable vector IR gen is guarded by conditions set by TTI. 

For shuffles, the most used case is broadcast which is supported by the
current semantics of `shufflevector` instruction. For other cases like
reverse, concat, etc., we have defined our own intrinsics.

Current limitaitons:
The cost model for scalable vectors doesn't do much other than always
decideing to vectorize with VF based on TargetWidestType/SmallestType.
We also do not support interleaving yet.

Demo:
The current implementation is very much in alpha and eventually, once
it's more polished and thoroughly verified, we will put out patches on
Phabricator. Till then, we have set up a Compiler Explorer server
against our development branch to showcase the generated code.

You can see and experiment with the generated LLVM IR and VPlan for a
set of examples, with predicated vector loop (`-mprefer-predicate-over-
epilog`) at https://repo.hca.bsc.es/epic/z/JB4ZoJ
and with a scalar epilog (`-mno-prefer-predicate-over-epilog`) at 
https://repo.hca.bsc.es/epic/z/0WoDGt. 
Note that you can remove the `-emit-llvm` option to see the generated
RISC-V assembly. 

We welcome any questions and feedback.

Thanks and Regards,
Vineet Kumar - vineet...@bsc.es
Barcelona Supercomputing Center - Centro Nacional de Supercomputación

  ; Call the target hook to let the target select %mask and %evl params for the loop header

  %evl, %mask <- IRBuilder.createIterationPredicate(%i, %n, TTI)

  ; Some examples:

  ; RISC-V V & VE(*):

  ;   
%mask = (splat i1 1)

  ;   %evl = min(256, %n - %i)

  ; MVE/SVE :

  ;   %mask = get.active.lane.mask(%i, %n)

  ;   %evl = call @llvm.vscale()

  ; AVX:

  ;  %mask = icmp (%i + (seq <8 x i32> 0,1,2,.,)), %n,

  ;  %evl = i32 8

  ; Configure the Vector Predication builder to use those

  VPBuilder

      .setExplicitVectorLength(%evl)

      .setMask(%mask);

  ; Start buildling vector-predicated instructions

  VPBuilder.createFadd(%x, %y)    ; --> call @llvm.vp.fadd(%x, %y, %mask, %evl)
Hi Simon

  ; Some examples:

  ; RISC-V V & VE(*):

  ;   
%mask = (splat i1 1)

  ;   %evl = min(256, %n - %i)

  ; MVE/SVE :

  ;   %mask = get.active.lane.mask(%i, %n)

  ;   %evl = call @llvm.vscale()

  ; AVX:

  ;  %mask = icmp (%i + (seq <8 x i32> 0,1,2,.,)), %n,

  ;  %evl = i32 8

; RISC-V V & VE(*):

  ;   
%mask = 
get.active.lane.mask(%i, %i)

  ;   %evl = min(256, %n - %i)

  ; MVE/SVE/AVX :

  ;   %mask = get.active.lane.mask(%i, %n)

  ;   %evl = call @llvm.vscale()

I am not sure why MVE (or AVX) would need the vscale(). But if it does, I am wondering if it could be something like:

; RISC-V V & VE(*):

  ;   
%mask = 
get.active.lane.mask(%i, %i)

  ;   %evl = call @llvm.vscale(256, %n - %i)
  ; MVE/SVE/AVX :

  ;   %mask = get.active.lane.mask(%i, %n)

  ;   %evl = call @llvm.vscale(... ,..)

Cheers,
Sjoerd.

  ; Some examples:

  ; RISC-V V & VE(*):

  ;   
%mask = (splat i1 1)

  ;   %evl = min(256, %n - %i)

  ; MVE/SVE :

  ;   %mask = get.active.lane.mask(%i, %n)

  ;   %evl = call @llvm.vscale()

  ; AVX:

  ;  %mask = icmp (%i + (seq <8 x i32> 0,1,2,.,)), %n,

  ;  %evl = i32 8

; RISC-V V & VE(*):

  ;   
%mask = 
get.active.lane.mask(%i, %i)

  ;   %evl = min(256, %n - %i)
  ; MVE/SVE/AVX :

  ;   %mask = get.active.lane.mask(%i, %n)

  ;   %evl = call @llvm.vscale()

I am not sure why MVE (or AVX) would need the vscale(). But if it does, I am wondering if it could be something like:

; RISC-V V & VE(*):

  ;   
%mask = 
get.active.lane.mask(%i, %i)

  ;   %evl = call @llvm.vscale(256, %n - %i)
  ; MVE/SVE/AVX :

  ;   %mask = get.active.lane.mask(%i, %n)

  ;   %evl = call @llvm.vscale(... ,..)

  llvm.vp.fadd nxv4f128(%x, %y, %mask, (@llvm.vscale() * 4))

 (VPIntrinsic::canIgnoreVectorLengthParam()).

Cheers,
Sjoerd.

; RISC-V V & VE(*):

  ;   
%mask = 
get.active.lane.mask(%i, %i)

  ;   %evl = min(256, %n - %i)
  ; MVE/SVE/AVX :

  ;   %mask = get.active.lane.mask(%i, %n)

  ;   %evl = call @llvm.vscale()

get.active.lane.mask, for example get.mask(%i, 0) can trivially
 be expanded/lowered to a (splat i1 1). This is not terribly important, but shows that get.active.lane.mask could be used for all targets I think; we don't need many cases. And kind of similarly, vscale can be a no-op or do something.

Cheers,
Sjoerd.

; RISC-V V & VE(*):

  ;   
%mask = 
get.active.lane.mask(%i, %i)

  ;   %evl = min(256, %n - %i)
  ; MVE/SVE/AVX :

  ;   %mask = get.active.lane.mask(%i, %n)

  ;   %evl = call @llvm.vscale()

get.active.lane.mask,
 for example get.mask(%i, 0) can trivially be expanded/lowered to a (splat i1 1). This is not terribly important, but shows that get.active.lane.mask could be used for all targets I think; we don't need many cases. And kind of similarly, vscale can be a no-op
 or do something

[llvm-dev] Loop-vectorizer prototype for the EPI Project based on the RISC-V Vector Extension (Scalable vectors)

Vineet Kumar via llvm-dev

Renato Golin via llvm-dev

Vineet Kumar via llvm-dev

Sjoerd Meijer via llvm-dev

Roger Ferrer Ibáñez via llvm-dev

Simon Moll via llvm-dev

Sjoerd Meijer via llvm-dev

Roger Ferrer Ibáñez via llvm-dev

Simon Moll via llvm-dev

Renato Golin via llvm-dev

Roger Ferrer Ibáñez via llvm-dev

Sjoerd Meijer via llvm-dev

Simon Moll via llvm-dev

Sjoerd Meijer via llvm-dev

Simon Moll via llvm-dev