Another missed vectorization opportunity for struct Vector4 { float x, y, z, w; }; https://godbolt.org/z/lbAN92 // Bad: Failed to combine loads + split result Vector4 Add(const Vector4 &a, const Vector4 &b) { Vector4 r; r.x = a.x + b.x; r.y = a.y + b.y; r.z = a.z + b.z; r.w = a.w + b.w; return r; } _Z3AddRK7Vector4S1_: # @_Z3AddRK7Vector4S1_ vmovsd (%rdi), %xmm0 # xmm0 = mem[0],zero vmovsd (%rsi), %xmm2 # xmm2 = mem[0],zero vmovsd 8(%rdi), %xmm1 # xmm1 = mem[0],zero vaddps %xmm2, %xmm0, %xmm0 vmovsd 8(%rsi), %xmm2 # xmm2 = mem[0],zero vaddps %xmm2, %xmm1, %xmm1 retq Not sure whether this can be handled in the SLP or DAG but we should be able to do something like: _Z3AddRK7Vector4S1_: # @_Z3AddRK7Vector4S1_ vmovups (%rdi), %xmm0 vaddps (%rsi), %xmm0, %xmm0 vunpckhpd %xmm0, %xmm0, %xmm1 # xmm1 = xmm0[2,3,2,3] retq
Some similar codegen issue for a Matrix22 type: https://godbolt.org/z/gG1_GG struct Vector2 { float x, y; }; struct Matrix22 { Vector2 x, y; }; // BAD: Fails to merge loads, fadds or stores void Add(Matrix22 &a, Matrix22 &b, Matrix22 &r) { r.x = Add(a.x, b.x); r.y = Add(a.y, b.y); } _Z3AddR8Matrix22S0_S0_: # @_Z3AddR8Matrix22S0_S0_ vmovsd (%rdi), %xmm0 # xmm0 = mem[0],zero vmovsd (%rsi), %xmm1 # xmm1 = mem[0],zero vaddps %xmm1, %xmm0, %xmm0 vmovlps %xmm0, (%rdx) vmovsd 8(%rdi), %xmm0 # xmm0 = mem[0],zero vmovsd 8(%rsi), %xmm1 # xmm1 = mem[0],zero vaddps %xmm1, %xmm0, %xmm0 vmovlps %xmm0, 8(%rdx) retq void Add_alt(Matrix22 &a, Matrix22 &b, Matrix22 &r) { Matrix22 c; c.x = Add(a.x, b.x); c.y = Add(a.y, b.y); r = c; } // NEARLY AS BAD: Fails to merge loads or fadds but merges stores _Z7Add_altR8Matrix22S0_S0_: # @_Z7Add_altR8Matrix22S0_S0_ vmovsd (%rdi), %xmm0 # xmm0 = mem[0],zero vmovsd (%rsi), %xmm2 # xmm2 = mem[0],zero vmovsd 8(%rdi), %xmm1 # xmm1 = mem[0],zero vaddps %xmm2, %xmm0, %xmm0 vmovsd 8(%rsi), %xmm2 # xmm2 = mem[0],zero vaddps %xmm2, %xmm1, %xmm1 vmovlhps %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0],xmm1[0] vmovups %xmm0, (%rdx) retq
The root cause is in SLPVectorizer: struct {<2 x float>, <2 x float>} isn't recognized to be vectorizable as <4 x float>. Assigned to myself.
Here is the fix for vector part of this issue: https://reviews.llvm.org/D70068
Here is one of the fix to support matrix vectorization: https://reviews.llvm.org/D70924
(In reply to Anton Afanasyev from comment #4) > Here is one of the fix to support matrix vectorization: > https://reviews.llvm.org/D70924 Another part: https://reviews.llvm.org/D70587
Follow-up change to support matrix vectorization: https://reviews.llvm.org/D72689 . It provides revectorization of partially vectorized code.
Except first and last function, Clang trunk codegen looks worse and bigger than Clang 9.
Clang9: define dso_local { <2 x float>, <2 x float> } @_Z3Add7Vector4S_(<2 x float>, <2 x float>, <2 x float>, <2 x float>) local_unnamed_addr #1 { %5 = fadd <2 x float> %0, %2 %6 = fadd <2 x float> %1, %3 %7 = insertvalue { <2 x float>, <2 x float> } undef, <2 x float> %5, 0 %8 = insertvalue { <2 x float>, <2 x float> } %7, <2 x float> %6, 1 ret { <2 x float>, <2 x float> } %8 } Trunk: define dso_local { <2 x float>, <2 x float> } @_Z3Add7Vector4S_(<2 x float> %0, <2 x float> %1, <2 x float> %2, <2 x float> %3) local_unnamed_addr #1 { %5 = extractelement <2 x float> %0, i32 0 %6 = extractelement <2 x float> %2, i32 0 %7 = fadd float %5, %6 %8 = fadd <2 x float> %0, %2 %9 = insertelement <2 x float> %8, float %7, i32 0 %10 = extractelement <2 x float> %1, i32 0 %11 = extractelement <2 x float> %3, i32 0 %12 = fadd float %10, %11 %13 = fadd <2 x float> %1, %3 %14 = insertelement <2 x float> %13, float %12, i32 0 %15 = insertvalue { <2 x float>, <2 x float> } undef, <2 x float> %9, 0 %16 = insertvalue { <2 x float>, <2 x float> } %15, <2 x float> %14, 1 ret { <2 x float>, <2 x float> } %16 } Technically VectorCombiner could fix a lot of this, but we should work out whats caused it.
Hmm, I'm looking into this.
Can someone bisect to see where the code quality regressed between clang 9 and 10?
(In reply to Simon Pilgrim from comment #8) > define dso_local { <2 x float>, <2 x float> } @_Z3Add7Vector4S_(<2 x float> > %0, <2 x float> %1, <2 x float> %2, <2 x float> %3) local_unnamed_addr #1 { > %5 = extractelement <2 x float> %0, i32 0 > %6 = extractelement <2 x float> %2, i32 0 > %7 = fadd float %5, %6 > %8 = fadd <2 x float> %0, %2 > %9 = insertelement <2 x float> %8, float %7, i32 0 > %10 = extractelement <2 x float> %1, i32 0 > %11 = extractelement <2 x float> %3, i32 0 > %12 = fadd float %10, %11 > %13 = fadd <2 x float> %1, %3 > %14 = insertelement <2 x float> %13, float %12, i32 0 > %15 = insertvalue { <2 x float>, <2 x float> } undef, <2 x float> %9, 0 > %16 = insertvalue { <2 x float>, <2 x float> } %15, <2 x float> %14, 1 > ret { <2 x float>, <2 x float> } %16 > } > > Technically VectorCombiner could fix a lot of this, but we should work out > whats caused it. Similar to bug 45015, we should get the vector math ops back with: https://reviews.llvm.org/rG10ea01d80d6f ...and the backend can produce the same output as clang9 again on this example and the previous 1: Vector4 Add(const Vector4 a, const Vector4 b) { Vector4 r; r.x = a.x + b.x; r.y = a.y + b.y; r.z = a.z + b.z; r.w = a.w + b.w; return r; } ...but we need to enhance VectorCombine to get ideal IR, and VectorCombine is independent of any regressions between clang9 and clang10.
Yes, I've found that this regression is not related to the current PR and to the related commits (D70924, D70587, D72689).
I don't think we can block the release on this. Unblocking.
Some patches landed so… fixed?
(In reply to Simon Pilgrim from comment #1) > Some similar codegen issue for a Matrix22 type: https://godbolt.org/z/gG1_GG Some of these matrix cases still need attention