8 x UNPCKLPS/UNPCKHPS
4 x SHUFPS
8 x BLENDPS
4 x INSERTF128
4 x PERM2F128
> I compile this further with
>
> clang -x ir -emit-llvm -S -mcpu=haswell -O3 -o - | llc -O3
> -mcpu=haswell - -o -
>
> to obtain:
>
> https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a
and after the x86 shuffle combines:
8 x UNPCKLPS/UNPCKHPS
8 x UNPCKLPD/UNPCKHPD
4 x INSERTF128
4 x PERM2F128
Starting from each BLENDPS, they've combined with the SHUFPS to create
the UNPCK*PD nodes. We nearly always benefit from folding shuffle chains
to reduce total instruction counts, even if some inner nodes have
multiple uses (like the SHUFPS), and I'd hate to lose that.
> At this point, I would expect to see some vblendps instructions
> generated for the pieces of IR that produce %48/%49 %51/%52 %54/%55
> and %57/%58 to reduce pressure on port 5 (vblendps can also go on
> ports 0 and 1). However the expected instruction does not get
> generated and llvm-mca continues to show me high port 5 contention.
>
> Could people suggest some steps / commands to help better understand
> why my expectation is not met and whether I can do something to make
> the compiler generate what I want? Thanks in advance!
So on Haswell, we've gained 4 extra Port5-only shuffles but removed the
8 Port015 blends.
We have very little arch-specific shuffle combines, just the
fast-variable-shuffle tuning flags to avoid unnecessary shuffle mask
loads, the shuffle combines just aims for the reduction in simple target
shuffle nodes. And tbh I'm reluctant to add to this as shuffle combining
is complex already.
We should be preferring to lower/combine to BLENDPS in more
circumstances (its commutable and never slower than any other target
shuffle, although demanded elts can do less with 'undef' elements), but
that won't help us here.
So far I've failed to find a BLEND-based 8x8 transpose pattern that the
shuffle combiner doesn't manage to combine back to the 8xUNPCK/SHUFPS ops :(
> I have verified independently that in isolation, a single such shuffle
> creates a vblendps. I see them being recombined in the produced
> assembly and I am looking for experimenting with avoiding that vshufps
> + vblendps + vblendps get recombined into vunpckxxx + vunpckxxx
> instructions.
>
> --
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
The only thing I can think of is you might want to see if you can
reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and
the SHUFPS/BLENDPS:
8 x UNPCKLPS/UNPCKHPS
4 x INSERTF128
4 x PERM2F128
4 x SHUFPS
8 x BLENDPS
Splitting the per-lane shuffles with the subvector-shuffles could help
stop the shuffle combiner.
The only thing I can think of is you might want to see if you can
reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and
the SHUFPS/BLENDPS:
8 x UNPCKLPS/UNPCKHPS
4 x INSERTF128
4 x PERM2F128
4 x SHUFPS
8 x BLENDPS
Splitting the per-lane shuffles with the subvector-shuffles could help
stop the shuffle combiner.
>> I have verified independently that in isolation, a single such
>> shuffle creates a vblendps. I see them being recombined in the
>> produced assembly and I am looking for experimenting with avoiding
>> that vshufps + vblendps + vblendps get recombined into vunpckxxx +
>> vunpckxxx instructions.
>>
>> --
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>As I am very new to this part of LLVM, I am not sure what is feasible or not. Would it be envisionnable to either:
>1. have a way to inject some numeric cost to influence the value of some resulting combinations?
>2. revive some form of intrinsic and guarantee that the instruction would be generated?
I think a feasible way is to add a new tuningXXX feature for given targets and do something different with the flag in the combine.
1) seems overengineering and 2) seems overkilled for potential opportunities by the combine.
Thanks
Phoebe
Nicolas - have you investigated just using inline asm instead?
Roman