Failure to merge 2 * <2 x float> load + fadd and then split the result back to 2 * <2 x float> #41367

RKSimon · 2019-05-26T12:12:34Z


Bugzilla Link	42022
Version	trunk
OS	Windows NT
CC	@alexey-bataev,@anton-afanasyev,@topperc,@davidbolvansky,@zmodem,@RKSimon,@rotateright

Extended Description

Another missed vectorization opportunity for

struct Vector4 {
    float x, y, z, w;
};

https://godbolt.org/z/lbAN92

// Bad: Failed to combine loads + split result
Vector4 Add(const Vector4 &a, const Vector4 &b) {
    Vector4 r;
    r.x = a.x + b.x;
    r.y = a.y + b.y;
    r.z = a.z + b.z;
    r.w = a.w + b.w;
    return r;
}

_Z3AddRK7Vector4S1_: # @_Z3AddRK7Vector4S1_
  vmovsd (%rdi), %xmm0 # xmm0 = mem[0],zero
  vmovsd (%rsi), %xmm2 # xmm2 = mem[0],zero
  vmovsd 8(%rdi), %xmm1 # xmm1 = mem[0],zero
  vaddps %xmm2, %xmm0, %xmm0
  vmovsd 8(%rsi), %xmm2 # xmm2 = mem[0],zero
  vaddps %xmm2, %xmm1, %xmm1
  retq

Not sure whether this can be handled in the SLP or DAG but we should be able to do something like:

_Z3AddRK7Vector4S1_: # @_Z3AddRK7Vector4S1_
  vmovups (%rdi), %xmm0
  vaddps (%rsi), %xmm0, %xmm0
  vunpckhpd %xmm0, %xmm0, %xmm1 # xmm1 = xmm0[2,3,2,3]
  retq

The text was updated successfully, but these errors were encountered:

RKSimon · 2019-05-26T12:12:34Z

assigned to @anton-afanasyev

RKSimon · 2019-05-26T12:18:13Z

Some similar codegen issue for a Matrix22 type: https://godbolt.org/z/gG1_GG

struct Vector2 {
    float x, y;
};

struct Matrix22 {
    Vector2 x, y;
};

// BAD: Fails to merge loads, fadds or stores
void Add(Matrix22 &a, Matrix22 &b, Matrix22 &r) {
    r.x = Add(a.x, b.x);
    r.y = Add(a.y, b.y);
}

_Z3AddR8Matrix22S0_S0_: # @_Z3AddR8Matrix22S0_S0_
  vmovsd (%rdi), %xmm0 # xmm0 = mem[0],zero
  vmovsd (%rsi), %xmm1 # xmm1 = mem[0],zero
  vaddps %xmm1, %xmm0, %xmm0
  vmovlps %xmm0, (%rdx)
  vmovsd 8(%rdi), %xmm0 # xmm0 = mem[0],zero
  vmovsd 8(%rsi), %xmm1 # xmm1 = mem[0],zero
  vaddps %xmm1, %xmm0, %xmm0
  vmovlps %xmm0, 8(%rdx)
  retq

void Add_alt(Matrix22 &a, Matrix22 &b, Matrix22 &r) {
    Matrix22 c;
    c.x = Add(a.x, b.x);
    c.y = Add(a.y, b.y);
    r = c;
}

// NEARLY AS BAD: Fails to merge loads or fadds but merges stores
_Z7Add_altR8Matrix22S0_S0_: # @_Z7Add_altR8Matrix22S0_S0_
  vmovsd (%rdi), %xmm0 # xmm0 = mem[0],zero
  vmovsd (%rsi), %xmm2 # xmm2 = mem[0],zero
  vmovsd 8(%rdi), %xmm1 # xmm1 = mem[0],zero
  vaddps %xmm2, %xmm0, %xmm0
  vmovsd 8(%rsi), %xmm2 # xmm2 = mem[0],zero
  vaddps %xmm2, %xmm1, %xmm1
  vmovlhps %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0],xmm1[0]
  vmovups %xmm0, (%rdx)
  retq

anton-afanasyev · 2019-07-05T17:52:22Z

The root cause is in SLPVectorizer: struct {<2 x float>, <2 x float>} isn't recognized to be vectorizable as <4 x float>. Assigned to myself.

anton-afanasyev · 2019-11-14T07:30:16Z

Here is the fix for vector part of this issue: https://reviews.llvm.org/D70068

anton-afanasyev · 2019-12-02T22:06:28Z

Here is one of the fix to support matrix vectorization: https://reviews.llvm.org/D70924

anton-afanasyev · 2019-12-02T22:08:11Z

Here is one of the fix to support matrix vectorization:
https://reviews.llvm.org/D70924

Another part: https://reviews.llvm.org/D70587

anton-afanasyev · 2020-01-15T08:12:57Z

Follow-up change to support matrix vectorization: https://reviews.llvm.org/D72689 . It provides revectorization of partially vectorized code.

davidbolvansky · 2020-02-21T12:23:26Z

Except first and last function, Clang trunk codegen looks worse and bigger than Clang 9.

RKSimon · 2020-02-21T15:07:32Z

Clang9:

define dso_local { <2 x float>, <2 x float> } @_Z3Add7Vector4S_(<2 x float>, <2 x float>, <2 x float>, <2 x float>) local_unnamed_addr #1 {
  %5 = fadd <2 x float> %0, %2
  %6 = fadd <2 x float> %1, %3
  %7 = insertvalue { <2 x float>, <2 x float> } undef, <2 x float> %5, 0
  %8 = insertvalue { <2 x float>, <2 x float> } %7, <2 x float> %6, 1
  ret { <2 x float>, <2 x float> } %8
}

Trunk:

define dso_local { <2 x float>, <2 x float> } @_Z3Add7Vector4S_(<2 x float> %0, <2 x float> %1, <2 x float> %2, <2 x float> %3) local_unnamed_addr #1 {
  %5 = extractelement <2 x float> %0, i32 0
  %6 = extractelement <2 x float> %2, i32 0
  %7 = fadd float %5, %6
  %8 = fadd <2 x float> %0, %2
  %9 = insertelement <2 x float> %8, float %7, i32 0
  %10 = extractelement <2 x float> %1, i32 0
  %11 = extractelement <2 x float> %3, i32 0
  %12 = fadd float %10, %11
  %13 = fadd <2 x float> %1, %3
  %14 = insertelement <2 x float> %13, float %12, i32 0
  %15 = insertvalue { <2 x float>, <2 x float> } undef, <2 x float> %9, 0
  %16 = insertvalue { <2 x float>, <2 x float> } %15, <2 x float> %14, 1
  ret { <2 x float>, <2 x float> } %16
}

Technically VectorCombiner could fix a lot of this, but we should work out whats caused it.

anton-afanasyev · 2020-02-21T19:42:20Z

Hmm, I'm looking into this.

zmodem · 2020-02-25T14:29:24Z

Can someone bisect to see where the code quality regressed between clang 9 and 10?

rotateright · 2020-02-25T15:22:55Z

define dso_local { <2 x float>, <2 x float> } @_Z3Add7Vector4S_(<2 x float>
%0, <2 x float> %1, <2 x float> %2, <2 x float> %3) local_unnamed_addr #1 {
  %5 = extractelement <2 x float> %0, i32 0
  %6 = extractelement <2 x float> %2, i32 0
  %7 = fadd float %5, %6
  %8 = fadd <2 x float> %0, %2
  %9 = insertelement <2 x float> %8, float %7, i32 0
  %10 = extractelement <2 x float> %1, i32 0
  %11 = extractelement <2 x float> %3, i32 0
  %12 = fadd float %10, %11
  %13 = fadd <2 x float> %1, %3
  %14 = insertelement <2 x float> %13, float %12, i32 0
  %15 = insertvalue { <2 x float>, <2 x float> } undef, <2 x float> %9, 0
  %16 = insertvalue { <2 x float>, <2 x float> } %15, <2 x float> %14, 1
  ret { <2 x float>, <2 x float> } %16
}

Technically VectorCombiner could fix a lot of this, but we should work out
whats caused it.

Similar to bug 45015, we should get the vector math ops back with:
https://reviews.llvm.org/rG10ea01d80d6f
...and the backend can produce the same output as clang9 again on this example and the previous 1:

Vector4 Add(const Vector4 a, const Vector4 b) {
    Vector4 r;
    r.x = a.x + b.x;
    r.y = a.y + b.y;
    r.z = a.z + b.z;
    r.w = a.w + b.w;
    return r;
}

...but we need to enhance VectorCombine to get ideal IR, and VectorCombine is independent of any regressions between clang9 and clang10.

anton-afanasyev · 2020-02-25T16:07:03Z

Yes, I've found that this regression is not related to the current PR and to the related commits (D70924, D70587, D72689).

zmodem · 2020-02-26T15:23:25Z

I don't think we can block the release on this. Unblocking.

davidbolvansky · 2021-07-06T14:00:50Z

Some patches landed so… fixed?

RKSimon · 2021-07-06T14:08:51Z

Some similar codegen issue for a Matrix22 type: https://godbolt.org/z/gG1_GG

Some of these matrix cases still need attention

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to merge 2 * <2 x float> load + fadd and then split the result back to 2 * <2 x float> #41367

Failure to merge 2 * <2 x float> load + fadd and then split the result back to 2 * <2 x float> #41367

RKSimon commented May 26, 2019 •

edited

Loading

RKSimon commented May 26, 2019

RKSimon commented May 26, 2019 •

edited

Loading

anton-afanasyev commented Jul 5, 2019

anton-afanasyev commented Nov 14, 2019

anton-afanasyev commented Dec 2, 2019

anton-afanasyev commented Dec 2, 2019

anton-afanasyev commented Jan 15, 2020

davidbolvansky commented Feb 21, 2020

RKSimon commented Feb 21, 2020 •

edited

Loading

anton-afanasyev commented Feb 21, 2020

zmodem commented Feb 25, 2020

rotateright commented Feb 25, 2020 •

edited by RKSimon

Loading

anton-afanasyev commented Feb 25, 2020

zmodem commented Feb 26, 2020

davidbolvansky commented Jul 6, 2021

RKSimon commented Jul 6, 2021

Failure to merge 2 * <2 x float> load + fadd and then split the result back to 2 * <2 x float> #41367

Failure to merge 2 * <2 x float> load + fadd and then split the result back to 2 * <2 x float> #41367

Comments

RKSimon commented May 26, 2019 • edited Loading

Extended Description

RKSimon commented May 26, 2019

RKSimon commented May 26, 2019 • edited Loading

anton-afanasyev commented Jul 5, 2019

anton-afanasyev commented Nov 14, 2019

anton-afanasyev commented Dec 2, 2019

anton-afanasyev commented Dec 2, 2019

anton-afanasyev commented Jan 15, 2020

davidbolvansky commented Feb 21, 2020

RKSimon commented Feb 21, 2020 • edited Loading

anton-afanasyev commented Feb 21, 2020

zmodem commented Feb 25, 2020

rotateright commented Feb 25, 2020 • edited by RKSimon Loading

anton-afanasyev commented Feb 25, 2020

zmodem commented Feb 26, 2020

davidbolvansky commented Jul 6, 2021

RKSimon commented Jul 6, 2021

RKSimon commented May 26, 2019 •

edited

Loading

RKSimon commented May 26, 2019 •

edited

Loading

RKSimon commented Feb 21, 2020 •

edited

Loading

rotateright commented Feb 25, 2020 •

edited by RKSimon

Loading