SLP vectorization when performing scalar operations on vector elements #6618

thielema · 2010-02-05T21:46:17Z


Bugzilla Link	6246
Resolution	FIXED
Resolved on	Jan 08, 2017 14:54
Version	2.6
OS	Linux
CC	@alexey-bataev,@RKSimon,@rotateright,@tobiasgrosser

Extended Description

In my automatically generated code it often happens that scalar operations are applied to vector elements that could have been written as vector operations as well. E.g. (due to modularization issues) I generate code like

define <4 x float> @_vadd(<4 x float>, <4 x float>) {
%a0 = extractelement <4 x float> %0, i32 0
%b0 = extractelement <4 x float> %1, i32 0
%c0 = fadd float %a0, %b0
%a1 = extractelement <4 x float> %0, i32 1
%b1 = extractelement <4 x float> %1, i32 1
%c1 = fadd float %a1, %b1
%a2 = extractelement <4 x float> %0, i32 2
%b2 = extractelement <4 x float> %1, i32 2
%c2 = fadd float %a2, %b2
%a3 = extractelement <4 x float> %0, i32 3
%b3 = extractelement <4 x float> %1, i32 3
%c3 = fadd float %a3, %b3
%d0 = insertelement <4 x float> undef, float %c0, i32 0
%d1 = insertelement <4 x float> %d0, float %c1, i32 1
%d2 = insertelement <4 x float> %d1, float %c2, i32 2
%d3 = insertelement <4 x float> %d2, float %c3, i32 3
ret <4 x float> %d3
}

I think it would be both correct and more efficient to swap 'fadd's and 'extractelements' by an optimization pass which would yield:

define <4 x float> @_vadd(<4 x float>, <4 x float>) nounwind readnone {
%c = fadd <4 x float> %0, %1
%c0 = extractelement <4 x float> %c, i32 0
%c1 = extractelement <4 x float> %c, i32 1
%c2 = extractelement <4 x float> %c, i32 2
%c3 = extractelement <4 x float> %c, i32 3
%d0 = insertelement <4 x float> undef, float %c0, i32 0
%d1 = insertelement <4 x float> %d0, float %c1, i32 1
%d2 = insertelement <4 x float> %d1, float %c2, i32 2
%d3 = insertelement <4 x float> %d2, float %c3, i32 3
ret <4 x float> %d3
}

That the remaining extractelements and insertelements are the identity transform is already correctly detected both by the optimizer and the (X86) code generator. The optimizer transforms the last piece of code to something like:

define <4 x float> @_vadd(<4 x float>, <4 x float>) nounwind readnone {
%c = fadd <4 x float> %0, %1
ret <4 x float> %c
}

llvmbot · 2012-05-27T04:02:42Z

I think that this should be handled by the BB vectorizer.

tobiasgrosser · 2013-07-18T20:41:05Z

Nadav, any plans to handle this in the SLP vectorizer?

llvmbot · 2013-07-18T20:47:03Z

Yea, it should be easy to do. We have a few heuristics for starting an SLP chain. We should add a rule for starting a chain at a bunch of inserts.

RKSimon · 2016-10-13T16:26:18Z

Is this actually fixed? A quick ToT of either didn't look any better.

alexey-bataev · 2016-10-13T16:45:36Z

Yes, I checked it and it works as required

RKSimon · 2016-11-12T17:43:32Z

Reopening this. If we look at the equivalent c source then it appears that the vectorizer isn't working for non 128-bit vectors:

#include <x86intrin.h>

__m128 _vadd128(__m128 a, __m128 b) {
return _mm_setr_ps(
a[0] + b[0],
a[1] + b[1],
a[2] + b[2],
a[3] + b[3]);
}

__m256 _vadd256(__m256 a, __m256 b) {
return _mm256_setr_ps(
a[0] + b[0],
a[1] + b[1],
a[2] + b[2],
a[3] + b[3],
a[4] + b[4],
a[5] + b[5],
a[6] + b[6],
a[7] + b[7]);
}

clang -S -O3 pr6246.c -march=btver2 -o - -emit-llvm

; Function Attrs: norecurse nounwind readnone ssp uwtable
define <4 x float> @_vadd128(<4 x float> %a, <4 x float> %b) local_unnamed_addr #0 {
entry:
%0 = fadd <4 x float> %a, %b
ret <4 x float> %0
}

; Function Attrs: norecurse nounwind readnone ssp uwtable
define <8 x float> @_vadd256(<8 x float> %a, <8 x float> %b) local_unnamed_addr #0 {
entry:
%vecext = extractelement <8 x float> %a, i32 0
%vecext1 = extractelement <8 x float> %b, i32 0
%add = fadd float %vecext, %vecext1
%vecext2 = extractelement <8 x float> %a, i32 1
%vecext3 = extractelement <8 x float> %b, i32 1
%add4 = fadd float %vecext2, %vecext3
%vecext5 = extractelement <8 x float> %a, i32 2
%vecext6 = extractelement <8 x float> %b, i32 2
%add7 = fadd float %vecext5, %vecext6
%vecext8 = extractelement <8 x float> %a, i32 3
%vecext9 = extractelement <8 x float> %b, i32 3
%add10 = fadd float %vecext8, %vecext9
%vecext11 = extractelement <8 x float> %a, i32 4
%vecext12 = extractelement <8 x float> %b, i32 4
%add13 = fadd float %vecext11, %vecext12
%vecext14 = extractelement <8 x float> %a, i32 5
%vecext15 = extractelement <8 x float> %b, i32 5
%add16 = fadd float %vecext14, %vecext15
%vecext17 = extractelement <8 x float> %a, i32 6
%vecext18 = extractelement <8 x float> %b, i32 6
%add19 = fadd float %vecext17, %vecext18
%vecext20 = extractelement <8 x float> %a, i32 7
%vecext21 = extractelement <8 x float> %b, i32 7
%add22 = fadd float %vecext20, %vecext21
%vecinit.i = insertelement <8 x float> undef, float %add, i32 0
%vecinit1.i = insertelement <8 x float> %vecinit.i, float %add4, i32 1
%vecinit2.i = insertelement <8 x float> %vecinit1.i, float %add7, i32 2
%vecinit3.i = insertelement <8 x float> %vecinit2.i, float %add10, i32 3
%vecinit4.i = insertelement <8 x float> %vecinit3.i, float %add13, i32 4
%vecinit5.i = insertelement <8 x float> %vecinit4.i, float %add16, i32 5
%vecinit6.i = insertelement <8 x float> %vecinit5.i, float %add19, i32 6
%vecinit7.i = insertelement <8 x float> %vecinit6.i, float %add22, i32 7
ret <8 x float> %vecinit7.i
}

alexey-bataev · 2016-12-01T22:52:21Z

Must be fixed in r288412

llvmbot · 2016-12-02T00:59:52Z

r288412 was reverted in r288431, reopening.

RKSimon · 2016-12-02T12:55:52Z

I've added tests for this (128/256/512 bit float/double vectors) to llvm\test\Transforms\SLPVectorizer\X86\arith-fp.ll at r288492

alexey-bataev · 2016-12-02T17:07:35Z

Now, I think, it is finally fixed in r288497

llvmbot · 2016-12-02T20:58:16Z

Reverted again in r288508, reopening.

RKSimon · 2017-01-08T22:54:00Z

Fixed at rL289043 - it seems to have stuck this time.

[5.9] Cherry-pick missing cas-related commits from `stable/20221013`

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 3, 2021

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SLP vectorization when performing scalar operations on vector elements #6618

SLP vectorization when performing scalar operations on vector elements #6618

thielema mannequin commented Feb 5, 2010

llvmbot commented May 27, 2012

tobiasgrosser commented Jul 18, 2013

llvmbot commented Jul 18, 2013

RKSimon commented Oct 13, 2016

alexey-bataev commented Oct 13, 2016

RKSimon commented Nov 12, 2016

alexey-bataev commented Dec 1, 2016

llvmbot commented Dec 2, 2016

RKSimon commented Dec 2, 2016

alexey-bataev commented Dec 2, 2016

llvmbot commented Dec 2, 2016

RKSimon commented Jan 8, 2017

SLP vectorization when performing scalar operations on vector elements #6618

SLP vectorization when performing scalar operations on vector elements #6618

Comments

thielema mannequin commented Feb 5, 2010

Extended Description

llvmbot commented May 27, 2012

tobiasgrosser commented Jul 18, 2013

llvmbot commented Jul 18, 2013

RKSimon commented Oct 13, 2016

alexey-bataev commented Oct 13, 2016

RKSimon commented Nov 12, 2016

alexey-bataev commented Dec 1, 2016

llvmbot commented Dec 2, 2016

RKSimon commented Dec 2, 2016

alexey-bataev commented Dec 2, 2016

llvmbot commented Dec 2, 2016

RKSimon commented Jan 8, 2017