42022 – Failure to merge 2 * <2 x float> load + fadd and then split the result back to 2 * <2 x float>

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 42022 - Failure to merge 2 * <2 x float> load + fadd and then split the result back to 2 * <2 x float>

Summary: Failure to merge 2 * <2 x float> load + fadd and then split the result back t...

Status:	NEW

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Backend: X86 (show other bugs)
Version:	trunk
Hardware:	PC Windows NT

Importance:	P enhancement
Assignee:	Anton Afanasyev

URL:
Keywords:

Depends on:
Blocks:

Reported:	2019-05-26 05:12 PDT by Simon Pilgrim
Modified:	2021-07-06 07:08 PDT (History)
CC List:	9 users (show)

See Also:	41892 45015
Fixed By Commit(s):

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Simon Pilgrim 2019-05-26 05:12:34 PDT

Another missed vectorization opportunity for

struct Vector4 {
    float x, y, z, w;
};

https://godbolt.org/z/lbAN92

// Bad: Failed to combine loads + split result
Vector4 Add(const Vector4 &a, const Vector4 &b) {
    Vector4 r;
    r.x = a.x + b.x;
    r.y = a.y + b.y;
    r.z = a.z + b.z;
    r.w = a.w + b.w;
    return r;
}

_Z3AddRK7Vector4S1_: # @_Z3AddRK7Vector4S1_
  vmovsd (%rdi), %xmm0 # xmm0 = mem[0],zero
  vmovsd (%rsi), %xmm2 # xmm2 = mem[0],zero
  vmovsd 8(%rdi), %xmm1 # xmm1 = mem[0],zero
  vaddps %xmm2, %xmm0, %xmm0
  vmovsd 8(%rsi), %xmm2 # xmm2 = mem[0],zero
  vaddps %xmm2, %xmm1, %xmm1
  retq

Not sure whether this can be handled in the SLP or DAG but we should be able to do something like:

_Z3AddRK7Vector4S1_: # @_Z3AddRK7Vector4S1_
  vmovups (%rdi), %xmm0
  vaddps (%rsi), %xmm0, %xmm0
  vunpckhpd %xmm0, %xmm0, %xmm1 # xmm1 = xmm0[2,3,2,3]
  retq

Comment 1 Simon Pilgrim 2019-05-26 05:18:13 PDT

Some similar codegen issue for a Matrix22 type: https://godbolt.org/z/gG1_GG

struct Vector2 {
    float x, y;
};

struct Matrix22 {
    Vector2 x, y;
};

// BAD: Fails to merge loads, fadds or stores
void Add(Matrix22 &a, Matrix22 &b, Matrix22 &r) {
    r.x = Add(a.x, b.x);
    r.y = Add(a.y, b.y);
}

_Z3AddR8Matrix22S0_S0_: # @_Z3AddR8Matrix22S0_S0_
  vmovsd (%rdi), %xmm0 # xmm0 = mem[0],zero
  vmovsd (%rsi), %xmm1 # xmm1 = mem[0],zero
  vaddps %xmm1, %xmm0, %xmm0
  vmovlps %xmm0, (%rdx)
  vmovsd 8(%rdi), %xmm0 # xmm0 = mem[0],zero
  vmovsd 8(%rsi), %xmm1 # xmm1 = mem[0],zero
  vaddps %xmm1, %xmm0, %xmm0
  vmovlps %xmm0, 8(%rdx)
  retq

void Add_alt(Matrix22 &a, Matrix22 &b, Matrix22 &r) {
    Matrix22 c;
    c.x = Add(a.x, b.x);
    c.y = Add(a.y, b.y);
    r = c;
}

// NEARLY AS BAD: Fails to merge loads or fadds but merges stores
_Z7Add_altR8Matrix22S0_S0_: # @_Z7Add_altR8Matrix22S0_S0_
  vmovsd (%rdi), %xmm0 # xmm0 = mem[0],zero
  vmovsd (%rsi), %xmm2 # xmm2 = mem[0],zero
  vmovsd 8(%rdi), %xmm1 # xmm1 = mem[0],zero
  vaddps %xmm2, %xmm0, %xmm0
  vmovsd 8(%rsi), %xmm2 # xmm2 = mem[0],zero
  vaddps %xmm2, %xmm1, %xmm1
  vmovlhps %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0],xmm1[0]
  vmovups %xmm0, (%rdx)
  retq

Comment 2 Anton Afanasyev 2019-07-05 10:52:22 PDT

The root cause is in SLPVectorizer: struct {<2 x float>, <2 x float>} isn't recognized to be vectorizable as <4 x float>. Assigned to myself.

Comment 3 Anton Afanasyev 2019-11-13 23:30:16 PST

Here is the fix for vector part of this issue: https://reviews.llvm.org/D70068

Comment 4 Anton Afanasyev 2019-12-02 14:06:28 PST

Here is one of the fix to support matrix vectorization: https://reviews.llvm.org/D70924

Comment 5 Anton Afanasyev 2019-12-02 14:08:11 PST

(In reply to Anton Afanasyev from comment #4)
> Here is one of the fix to support matrix vectorization:
> https://reviews.llvm.org/D70924

Another part: https://reviews.llvm.org/D70587

Comment 6 Anton Afanasyev 2020-01-15 00:12:57 PST

Follow-up change to support matrix vectorization: https://reviews.llvm.org/D72689 . It provides revectorization of partially vectorized code.

Comment 7 David Bolvansky 2020-02-21 04:23:26 PST

Except first and last function, Clang trunk codegen looks worse and bigger than Clang 9.

Comment 8 Simon Pilgrim 2020-02-21 07:07:32 PST

Clang9:

define dso_local { <2 x float>, <2 x float> } @_Z3Add7Vector4S_(<2 x float>, <2 x float>, <2 x float>, <2 x float>) local_unnamed_addr #1 {
  %5 = fadd <2 x float> %0, %2
  %6 = fadd <2 x float> %1, %3
  %7 = insertvalue { <2 x float>, <2 x float> } undef, <2 x float> %5, 0
  %8 = insertvalue { <2 x float>, <2 x float> } %7, <2 x float> %6, 1
  ret { <2 x float>, <2 x float> } %8
}

Trunk:

define dso_local { <2 x float>, <2 x float> } @_Z3Add7Vector4S_(<2 x float> %0, <2 x float> %1, <2 x float> %2, <2 x float> %3) local_unnamed_addr #1 {
  %5 = extractelement <2 x float> %0, i32 0
  %6 = extractelement <2 x float> %2, i32 0
  %7 = fadd float %5, %6
  %8 = fadd <2 x float> %0, %2
  %9 = insertelement <2 x float> %8, float %7, i32 0
  %10 = extractelement <2 x float> %1, i32 0
  %11 = extractelement <2 x float> %3, i32 0
  %12 = fadd float %10, %11
  %13 = fadd <2 x float> %1, %3
  %14 = insertelement <2 x float> %13, float %12, i32 0
  %15 = insertvalue { <2 x float>, <2 x float> } undef, <2 x float> %9, 0
  %16 = insertvalue { <2 x float>, <2 x float> } %15, <2 x float> %14, 1
  ret { <2 x float>, <2 x float> } %16
}

Technically VectorCombiner could fix a lot of this, but we should work out whats caused it.

Comment 9 Anton Afanasyev 2020-02-21 11:42:20 PST

Hmm, I'm looking into this.

Comment 10 Hans Wennborg 2020-02-25 06:29:24 PST

Can someone bisect to see where the code quality regressed between clang 9 and 10?

Comment 11 Sanjay Patel 2020-02-25 07:22:55 PST

(In reply to Simon Pilgrim from comment #8)
> define dso_local { <2 x float>, <2 x float> } @_Z3Add7Vector4S_(<2 x float>
> %0, <2 x float> %1, <2 x float> %2, <2 x float> %3) local_unnamed_addr #1 {
>   %5 = extractelement <2 x float> %0, i32 0
>   %6 = extractelement <2 x float> %2, i32 0
>   %7 = fadd float %5, %6
>   %8 = fadd <2 x float> %0, %2
>   %9 = insertelement <2 x float> %8, float %7, i32 0
>   %10 = extractelement <2 x float> %1, i32 0
>   %11 = extractelement <2 x float> %3, i32 0
>   %12 = fadd float %10, %11
>   %13 = fadd <2 x float> %1, %3
>   %14 = insertelement <2 x float> %13, float %12, i32 0
>   %15 = insertvalue { <2 x float>, <2 x float> } undef, <2 x float> %9, 0
>   %16 = insertvalue { <2 x float>, <2 x float> } %15, <2 x float> %14, 1
>   ret { <2 x float>, <2 x float> } %16
> }
> 
> Technically VectorCombiner could fix a lot of this, but we should work out
> whats caused it.

Similar to bug 45015, we should get the vector math ops back with:
https://reviews.llvm.org/rG10ea01d80d6f
...and the backend can produce the same output as clang9 again on this example and the previous 1:

Vector4 Add(const Vector4 a, const Vector4 b) {
    Vector4 r;
    r.x = a.x + b.x;
    r.y = a.y + b.y;
    r.z = a.z + b.z;
    r.w = a.w + b.w;
    return r;
}

...but we need to enhance VectorCombine to get ideal IR, and VectorCombine is independent of any regressions between clang9 and clang10.

Comment 12 Anton Afanasyev 2020-02-25 08:07:03 PST

Yes, I've found that this regression is not related to the current PR and to the related commits (D70924, D70587, D72689).

Comment 13 Hans Wennborg 2020-02-26 07:23:25 PST

I don't think we can block the release on this. Unblocking.

Comment 14 David Bolvansky 2021-07-06 07:00:50 PDT

Some patches landed so… fixed?

Comment 15 Simon Pilgrim 2021-07-06 07:08:51 PDT

(In reply to Simon Pilgrim from comment #1)
> Some similar codegen issue for a Matrix22 type: https://godbolt.org/z/gG1_GG

Some of these matrix cases still need attention