Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLP vectorization when performing scalar operations on vector elements #6618

Closed
thielema mannequin opened this issue Feb 5, 2010 · 12 comments
Closed

SLP vectorization when performing scalar operations on vector elements #6618

thielema mannequin opened this issue Feb 5, 2010 · 12 comments
Labels
bugzilla Issues migrated from bugzilla

Comments

@thielema
Copy link
Mannequin

thielema mannequin commented Feb 5, 2010

Bugzilla Link 6246
Resolution FIXED
Resolved on Jan 08, 2017 14:54
Version 2.6
OS Linux
CC @alexey-bataev,@RKSimon,@rotateright,@tobiasgrosser

Extended Description

In my automatically generated code it often happens that scalar operations are applied to vector elements that could have been written as vector operations as well. E.g. (due to modularization issues) I generate code like

define <4 x float> @​_vadd(<4 x float>, <4 x float>) {
%a0 = extractelement <4 x float> %0, i32 0
%b0 = extractelement <4 x float> %1, i32 0
%c0 = fadd float %a0, %b0
%a1 = extractelement <4 x float> %0, i32 1
%b1 = extractelement <4 x float> %1, i32 1
%c1 = fadd float %a1, %b1
%a2 = extractelement <4 x float> %0, i32 2
%b2 = extractelement <4 x float> %1, i32 2
%c2 = fadd float %a2, %b2
%a3 = extractelement <4 x float> %0, i32 3
%b3 = extractelement <4 x float> %1, i32 3
%c3 = fadd float %a3, %b3
%d0 = insertelement <4 x float> undef, float %c0, i32 0
%d1 = insertelement <4 x float> %d0, float %c1, i32 1
%d2 = insertelement <4 x float> %d1, float %c2, i32 2
%d3 = insertelement <4 x float> %d2, float %c3, i32 3
ret <4 x float> %d3
}

I think it would be both correct and more efficient to swap 'fadd's and 'extractelements' by an optimization pass which would yield:

define <4 x float> @​_vadd(<4 x float>, <4 x float>) nounwind readnone {
%c = fadd <4 x float> %0, %1
%c0 = extractelement <4 x float> %c, i32 0
%c1 = extractelement <4 x float> %c, i32 1
%c2 = extractelement <4 x float> %c, i32 2
%c3 = extractelement <4 x float> %c, i32 3
%d0 = insertelement <4 x float> undef, float %c0, i32 0
%d1 = insertelement <4 x float> %d0, float %c1, i32 1
%d2 = insertelement <4 x float> %d1, float %c2, i32 2
%d3 = insertelement <4 x float> %d2, float %c3, i32 3
ret <4 x float> %d3
}

That the remaining extractelements and insertelements are the identity transform is already correctly detected both by the optimizer and the (X86) code generator. The optimizer transforms the last piece of code to something like:

define <4 x float> @​_vadd(<4 x float>, <4 x float>) nounwind readnone {
%c = fadd <4 x float> %0, %1
ret <4 x float> %c
}

@llvmbot
Copy link
Collaborator

llvmbot commented May 27, 2012

I think that this should be handled by the BB vectorizer.

@tobiasgrosser
Copy link
Contributor

Nadav, any plans to handle this in the SLP vectorizer?

@llvmbot
Copy link
Collaborator

llvmbot commented Jul 18, 2013

Yea, it should be easy to do. We have a few heuristics for starting an SLP chain. We should add a rule for starting a chain at a bunch of inserts.

@RKSimon
Copy link
Collaborator

RKSimon commented Oct 13, 2016

Is this actually fixed? A quick ToT of either didn't look any better.

@alexey-bataev
Copy link
Member

Yes, I checked it and it works as required

@RKSimon
Copy link
Collaborator

RKSimon commented Nov 12, 2016

Reopening this. If we look at the equivalent c source then it appears that the vectorizer isn't working for non 128-bit vectors:

#include <x86intrin.h>

__m128 _vadd128(__m128 a, __m128 b) {
return _mm_setr_ps(
a[0] + b[0],
a[1] + b[1],
a[2] + b[2],
a[3] + b[3]);
}

__m256 _vadd256(__m256 a, __m256 b) {
return _mm256_setr_ps(
a[0] + b[0],
a[1] + b[1],
a[2] + b[2],
a[3] + b[3],
a[4] + b[4],
a[5] + b[5],
a[6] + b[6],
a[7] + b[7]);
}

clang -S -O3 pr6246.c -march=btver2 -o - -emit-llvm

; Function Attrs: norecurse nounwind readnone ssp uwtable
define <4 x float> @​_vadd128(<4 x float> %a, <4 x float> %b) local_unnamed_addr #​0 {
entry:
%0 = fadd <4 x float> %a, %b
ret <4 x float> %0
}

; Function Attrs: norecurse nounwind readnone ssp uwtable
define <8 x float> @​_vadd256(<8 x float> %a, <8 x float> %b) local_unnamed_addr #​0 {
entry:
%vecext = extractelement <8 x float> %a, i32 0
%vecext1 = extractelement <8 x float> %b, i32 0
%add = fadd float %vecext, %vecext1
%vecext2 = extractelement <8 x float> %a, i32 1
%vecext3 = extractelement <8 x float> %b, i32 1
%add4 = fadd float %vecext2, %vecext3
%vecext5 = extractelement <8 x float> %a, i32 2
%vecext6 = extractelement <8 x float> %b, i32 2
%add7 = fadd float %vecext5, %vecext6
%vecext8 = extractelement <8 x float> %a, i32 3
%vecext9 = extractelement <8 x float> %b, i32 3
%add10 = fadd float %vecext8, %vecext9
%vecext11 = extractelement <8 x float> %a, i32 4
%vecext12 = extractelement <8 x float> %b, i32 4
%add13 = fadd float %vecext11, %vecext12
%vecext14 = extractelement <8 x float> %a, i32 5
%vecext15 = extractelement <8 x float> %b, i32 5
%add16 = fadd float %vecext14, %vecext15
%vecext17 = extractelement <8 x float> %a, i32 6
%vecext18 = extractelement <8 x float> %b, i32 6
%add19 = fadd float %vecext17, %vecext18
%vecext20 = extractelement <8 x float> %a, i32 7
%vecext21 = extractelement <8 x float> %b, i32 7
%add22 = fadd float %vecext20, %vecext21
%vecinit.i = insertelement <8 x float> undef, float %add, i32 0
%vecinit1.i = insertelement <8 x float> %vecinit.i, float %add4, i32 1
%vecinit2.i = insertelement <8 x float> %vecinit1.i, float %add7, i32 2
%vecinit3.i = insertelement <8 x float> %vecinit2.i, float %add10, i32 3
%vecinit4.i = insertelement <8 x float> %vecinit3.i, float %add13, i32 4
%vecinit5.i = insertelement <8 x float> %vecinit4.i, float %add16, i32 5
%vecinit6.i = insertelement <8 x float> %vecinit5.i, float %add19, i32 6
%vecinit7.i = insertelement <8 x float> %vecinit6.i, float %add22, i32 7
ret <8 x float> %vecinit7.i
}

@alexey-bataev
Copy link
Member

Must be fixed in r288412

@llvmbot
Copy link
Collaborator

llvmbot commented Dec 2, 2016

r288412 was reverted in r288431, reopening.

@RKSimon
Copy link
Collaborator

RKSimon commented Dec 2, 2016

I've added tests for this (128/256/512 bit float/double vectors) to llvm\test\Transforms\SLPVectorizer\X86\arith-fp.ll at r288492

@alexey-bataev
Copy link
Member

Now, I think, it is finally fixed in r288497

@llvmbot
Copy link
Collaborator

llvmbot commented Dec 2, 2016

Reverted again in r288508, reopening.

@RKSimon
Copy link
Collaborator

RKSimon commented Jan 8, 2017

Fixed at rL289043 - it seems to have stuck this time.

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 3, 2021
Michael137 pushed a commit to Michael137/llvm-project that referenced this issue Apr 8, 2023
[5.9] Cherry-pick missing cas-related commits from `stable/20221013`
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

4 participants