-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SLP vectorization when performing scalar operations on vector elements #6618
Comments
I think that this should be handled by the BB vectorizer. |
Nadav, any plans to handle this in the SLP vectorizer? |
Yea, it should be easy to do. We have a few heuristics for starting an SLP chain. We should add a rule for starting a chain at a bunch of inserts. |
Is this actually fixed? A quick ToT of either didn't look any better. |
Yes, I checked it and it works as required |
Reopening this. If we look at the equivalent c source then it appears that the vectorizer isn't working for non 128-bit vectors: #include <x86intrin.h> __m128 _vadd128(__m128 a, __m128 b) { __m256 _vadd256(__m256 a, __m256 b) { clang -S -O3 pr6246.c -march=btver2 -o - -emit-llvm ; Function Attrs: norecurse nounwind readnone ssp uwtable ; Function Attrs: norecurse nounwind readnone ssp uwtable |
Must be fixed in r288412 |
r288412 was reverted in r288431, reopening. |
I've added tests for this (128/256/512 bit float/double vectors) to llvm\test\Transforms\SLPVectorizer\X86\arith-fp.ll at r288492 |
Now, I think, it is finally fixed in r288497 |
Reverted again in r288508, reopening. |
Fixed at rL289043 - it seems to have stuck this time. |
[5.9] Cherry-pick missing cas-related commits from `stable/20221013`
Extended Description
In my automatically generated code it often happens that scalar operations are applied to vector elements that could have been written as vector operations as well. E.g. (due to modularization issues) I generate code like
define <4 x float> @_vadd(<4 x float>, <4 x float>) {
%a0 = extractelement <4 x float> %0, i32 0
%b0 = extractelement <4 x float> %1, i32 0
%c0 = fadd float %a0, %b0
%a1 = extractelement <4 x float> %0, i32 1
%b1 = extractelement <4 x float> %1, i32 1
%c1 = fadd float %a1, %b1
%a2 = extractelement <4 x float> %0, i32 2
%b2 = extractelement <4 x float> %1, i32 2
%c2 = fadd float %a2, %b2
%a3 = extractelement <4 x float> %0, i32 3
%b3 = extractelement <4 x float> %1, i32 3
%c3 = fadd float %a3, %b3
%d0 = insertelement <4 x float> undef, float %c0, i32 0
%d1 = insertelement <4 x float> %d0, float %c1, i32 1
%d2 = insertelement <4 x float> %d1, float %c2, i32 2
%d3 = insertelement <4 x float> %d2, float %c3, i32 3
ret <4 x float> %d3
}
I think it would be both correct and more efficient to swap 'fadd's and 'extractelements' by an optimization pass which would yield:
define <4 x float> @_vadd(<4 x float>, <4 x float>) nounwind readnone {
%c = fadd <4 x float> %0, %1
%c0 = extractelement <4 x float> %c, i32 0
%c1 = extractelement <4 x float> %c, i32 1
%c2 = extractelement <4 x float> %c, i32 2
%c3 = extractelement <4 x float> %c, i32 3
%d0 = insertelement <4 x float> undef, float %c0, i32 0
%d1 = insertelement <4 x float> %d0, float %c1, i32 1
%d2 = insertelement <4 x float> %d1, float %c2, i32 2
%d3 = insertelement <4 x float> %d2, float %c3, i32 3
ret <4 x float> %d3
}
That the remaining extractelements and insertelements are the identity transform is already correctly detected both by the optimizer and the (X86) code generator. The optimizer transforms the last piece of code to something like:
define <4 x float> @_vadd(<4 x float>, <4 x float>) nounwind readnone {
%c = fadd <4 x float> %0, %1
ret <4 x float> %c
}
The text was updated successfully, but these errors were encountered: