Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to merge 2 * <2 x float> load + fadd and then split the result back to 2 * <2 x float> #41367

Open
RKSimon opened this issue May 26, 2019 · 16 comments
Assignees
Labels
backend:X86 bugzilla Issues migrated from bugzilla

Comments

@RKSimon
Copy link
Collaborator

RKSimon commented May 26, 2019

Bugzilla Link 42022
Version trunk
OS Windows NT
CC @alexey-bataev,@anton-afanasyev,@topperc,@davidbolvansky,@zmodem,@RKSimon,@rotateright

Extended Description

Another missed vectorization opportunity for

struct Vector4 {
    float x, y, z, w;
};

https://godbolt.org/z/lbAN92

// Bad: Failed to combine loads + split result
Vector4 Add(const Vector4 &a, const Vector4 &b) {
    Vector4 r;
    r.x = a.x + b.x;
    r.y = a.y + b.y;
    r.z = a.z + b.z;
    r.w = a.w + b.w;
    return r;
}
_Z3AddRK7Vector4S1_: # @_Z3AddRK7Vector4S1_
  vmovsd (%rdi), %xmm0 # xmm0 = mem[0],zero
  vmovsd (%rsi), %xmm2 # xmm2 = mem[0],zero
  vmovsd 8(%rdi), %xmm1 # xmm1 = mem[0],zero
  vaddps %xmm2, %xmm0, %xmm0
  vmovsd 8(%rsi), %xmm2 # xmm2 = mem[0],zero
  vaddps %xmm2, %xmm1, %xmm1
  retq

Not sure whether this can be handled in the SLP or DAG but we should be able to do something like:

_Z3AddRK7Vector4S1_: # @_Z3AddRK7Vector4S1_
  vmovups (%rdi), %xmm0
  vaddps (%rsi), %xmm0, %xmm0
  vunpckhpd %xmm0, %xmm0, %xmm1 # xmm1 = xmm0[2,3,2,3]
  retq
@RKSimon
Copy link
Collaborator Author

RKSimon commented May 26, 2019

assigned to @anton-afanasyev

@RKSimon
Copy link
Collaborator Author

RKSimon commented May 26, 2019

Some similar codegen issue for a Matrix22 type: https://godbolt.org/z/gG1_GG

struct Vector2 {
    float x, y;
};

struct Matrix22 {
    Vector2 x, y;
};

// BAD: Fails to merge loads, fadds or stores
void Add(Matrix22 &a, Matrix22 &b, Matrix22 &r) {
    r.x = Add(a.x, b.x);
    r.y = Add(a.y, b.y);
}
_Z3AddR8Matrix22S0_S0_: # @_Z3AddR8Matrix22S0_S0_
  vmovsd (%rdi), %xmm0 # xmm0 = mem[0],zero
  vmovsd (%rsi), %xmm1 # xmm1 = mem[0],zero
  vaddps %xmm1, %xmm0, %xmm0
  vmovlps %xmm0, (%rdx)
  vmovsd 8(%rdi), %xmm0 # xmm0 = mem[0],zero
  vmovsd 8(%rsi), %xmm1 # xmm1 = mem[0],zero
  vaddps %xmm1, %xmm0, %xmm0
  vmovlps %xmm0, 8(%rdx)
  retq
void Add_alt(Matrix22 &a, Matrix22 &b, Matrix22 &r) {
    Matrix22 c;
    c.x = Add(a.x, b.x);
    c.y = Add(a.y, b.y);
    r = c;
}
// NEARLY AS BAD: Fails to merge loads or fadds but merges stores
_Z7Add_altR8Matrix22S0_S0_: # @_Z7Add_altR8Matrix22S0_S0_
  vmovsd (%rdi), %xmm0 # xmm0 = mem[0],zero
  vmovsd (%rsi), %xmm2 # xmm2 = mem[0],zero
  vmovsd 8(%rdi), %xmm1 # xmm1 = mem[0],zero
  vaddps %xmm2, %xmm0, %xmm0
  vmovsd 8(%rsi), %xmm2 # xmm2 = mem[0],zero
  vaddps %xmm2, %xmm1, %xmm1
  vmovlhps %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0],xmm1[0]
  vmovups %xmm0, (%rdx)
  retq

@anton-afanasyev
Copy link
Contributor

The root cause is in SLPVectorizer: struct {<2 x float>, <2 x float>} isn't recognized to be vectorizable as <4 x float>. Assigned to myself.

@anton-afanasyev
Copy link
Contributor

Here is the fix for vector part of this issue: https://reviews.llvm.org/D70068

@anton-afanasyev
Copy link
Contributor

Here is one of the fix to support matrix vectorization: https://reviews.llvm.org/D70924

@anton-afanasyev
Copy link
Contributor

Here is one of the fix to support matrix vectorization:
https://reviews.llvm.org/D70924

Another part: https://reviews.llvm.org/D70587

@anton-afanasyev
Copy link
Contributor

Follow-up change to support matrix vectorization: https://reviews.llvm.org/D72689 . It provides revectorization of partially vectorized code.

@davidbolvansky
Copy link
Collaborator

Except first and last function, Clang trunk codegen looks worse and bigger than Clang 9.

@RKSimon
Copy link
Collaborator Author

RKSimon commented Feb 21, 2020

Clang9:

define dso_local { <2 x float>, <2 x float> } @_Z3Add7Vector4S_(<2 x float>, <2 x float>, <2 x float>, <2 x float>) local_unnamed_addr #1 {
  %5 = fadd <2 x float> %0, %2
  %6 = fadd <2 x float> %1, %3
  %7 = insertvalue { <2 x float>, <2 x float> } undef, <2 x float> %5, 0
  %8 = insertvalue { <2 x float>, <2 x float> } %7, <2 x float> %6, 1
  ret { <2 x float>, <2 x float> } %8
}

Trunk:

define dso_local { <2 x float>, <2 x float> } @_Z3Add7Vector4S_(<2 x float> %0, <2 x float> %1, <2 x float> %2, <2 x float> %3) local_unnamed_addr #1 {
  %5 = extractelement <2 x float> %0, i32 0
  %6 = extractelement <2 x float> %2, i32 0
  %7 = fadd float %5, %6
  %8 = fadd <2 x float> %0, %2
  %9 = insertelement <2 x float> %8, float %7, i32 0
  %10 = extractelement <2 x float> %1, i32 0
  %11 = extractelement <2 x float> %3, i32 0
  %12 = fadd float %10, %11
  %13 = fadd <2 x float> %1, %3
  %14 = insertelement <2 x float> %13, float %12, i32 0
  %15 = insertvalue { <2 x float>, <2 x float> } undef, <2 x float> %9, 0
  %16 = insertvalue { <2 x float>, <2 x float> } %15, <2 x float> %14, 1
  ret { <2 x float>, <2 x float> } %16
}

Technically VectorCombiner could fix a lot of this, but we should work out whats caused it.

@anton-afanasyev
Copy link
Contributor

Hmm, I'm looking into this.

@zmodem
Copy link
Collaborator

zmodem commented Feb 25, 2020

Can someone bisect to see where the code quality regressed between clang 9 and 10?

@rotateright
Copy link
Contributor

rotateright commented Feb 25, 2020

define dso_local { <2 x float>, <2 x float> } @_Z3Add7Vector4S_(<2 x float>
%0, <2 x float> %1, <2 x float> %2, <2 x float> %3) local_unnamed_addr #1 {
  %5 = extractelement <2 x float> %0, i32 0
  %6 = extractelement <2 x float> %2, i32 0
  %7 = fadd float %5, %6
  %8 = fadd <2 x float> %0, %2
  %9 = insertelement <2 x float> %8, float %7, i32 0
  %10 = extractelement <2 x float> %1, i32 0
  %11 = extractelement <2 x float> %3, i32 0
  %12 = fadd float %10, %11
  %13 = fadd <2 x float> %1, %3
  %14 = insertelement <2 x float> %13, float %12, i32 0
  %15 = insertvalue { <2 x float>, <2 x float> } undef, <2 x float> %9, 0
  %16 = insertvalue { <2 x float>, <2 x float> } %15, <2 x float> %14, 1
  ret { <2 x float>, <2 x float> } %16
}

Technically VectorCombiner could fix a lot of this, but we should work out
whats caused it.

Similar to bug 45015, we should get the vector math ops back with:
https://reviews.llvm.org/rG10ea01d80d6f
...and the backend can produce the same output as clang9 again on this example and the previous 1:

Vector4 Add(const Vector4 a, const Vector4 b) {
    Vector4 r;
    r.x = a.x + b.x;
    r.y = a.y + b.y;
    r.z = a.z + b.z;
    r.w = a.w + b.w;
    return r;
}

...but we need to enhance VectorCombine to get ideal IR, and VectorCombine is independent of any regressions between clang9 and clang10.

@anton-afanasyev
Copy link
Contributor

Yes, I've found that this regression is not related to the current PR and to the related commits (D70924, D70587, D72689).

@zmodem
Copy link
Collaborator

zmodem commented Feb 26, 2020

I don't think we can block the release on this. Unblocking.

@davidbolvansky
Copy link
Collaborator

Some patches landed so… fixed?

@RKSimon
Copy link
Collaborator Author

RKSimon commented Jul 6, 2021

Some similar codegen issue for a Matrix22 type: https://godbolt.org/z/gG1_GG

Some of these matrix cases still need attention

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

5 participants