LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 6246 - SLP vectorization when performing scalar operations on vector elements
Summary: SLP vectorization when performing scalar operations on vector elements
Status: RESOLVED FIXED
Alias: None
Product: libraries
Classification: Unclassified
Component: Scalar Optimizations (show other bugs)
Version: 2.6
Hardware: PC Linux
: P normal
Assignee: Unassigned LLVM Bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-02-05 13:46 PST by Henning Thielemann
Modified: 2017-01-08 14:54 PST (History)
7 users (show)

See Also:
Fixed By Commit(s):


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Henning Thielemann 2010-02-05 13:46:17 PST
In my automatically generated code it often happens that scalar operations are applied to vector elements that could have been written as vector operations as well. E.g. (due to modularization issues) I generate code like

define <4 x float> @_vadd(<4 x float>, <4 x float>) {
  %a0 = extractelement <4 x float> %0, i32 0
  %b0 = extractelement <4 x float> %1, i32 0
  %c0 = fadd float %a0, %b0
  %a1 = extractelement <4 x float> %0, i32 1
  %b1 = extractelement <4 x float> %1, i32 1
  %c1 = fadd float %a1, %b1
  %a2 = extractelement <4 x float> %0, i32 2
  %b2 = extractelement <4 x float> %1, i32 2
  %c2 = fadd float %a2, %b2
  %a3 = extractelement <4 x float> %0, i32 3
  %b3 = extractelement <4 x float> %1, i32 3
  %c3 = fadd float %a3, %b3
  %d0 = insertelement <4 x float> undef, float %c0, i32 0
  %d1 = insertelement <4 x float> %d0, float %c1, i32 1
  %d2 = insertelement <4 x float> %d1, float %c2, i32 2
  %d3 = insertelement <4 x float> %d2, float %c3, i32 3
  ret <4 x float> %d3
}

I think it would be both correct and more efficient to swap 'fadd's and 'extractelements' by an optimization pass which would yield:

define <4 x float> @_vadd(<4 x float>, <4 x float>) nounwind readnone {
  %c = fadd <4 x float> %0, %1
  %c0 = extractelement <4 x float> %c, i32 0
  %c1 = extractelement <4 x float> %c, i32 1
  %c2 = extractelement <4 x float> %c, i32 2
  %c3 = extractelement <4 x float> %c, i32 3
  %d0 = insertelement <4 x float> undef, float %c0, i32 0
  %d1 = insertelement <4 x float> %d0, float %c1, i32 1
  %d2 = insertelement <4 x float> %d1, float %c2, i32 2
  %d3 = insertelement <4 x float> %d2, float %c3, i32 3
  ret <4 x float> %d3
}

That the remaining extractelements and insertelements are the identity transform is already correctly detected both by the optimizer and the (X86) code generator. The optimizer transforms the last piece of code to something like:

define <4 x float> @_vadd(<4 x float>, <4 x float>) nounwind readnone {
  %c = fadd <4 x float> %0, %1
  ret <4 x float> %c
}
Comment 1 Nadav Rotem 2012-05-26 21:02:42 PDT
I think that this should be handled by the BB vectorizer.
Comment 2 Tobias Grosser 2013-07-18 13:41:05 PDT
Nadav, any plans to handle this in the SLP vectorizer?
Comment 3 Nadav Rotem 2013-07-18 13:47:03 PDT
Yea, it should be easy to do.  We have a few heuristics for starting an SLP chain. We should add a rule for starting a chain at a bunch of inserts.
Comment 4 Simon Pilgrim 2016-10-13 09:26:18 PDT
Is this actually fixed? A quick ToT of either didn't look any better.
Comment 5 Alexey Bataev 2016-10-13 09:45:36 PDT
Yes, I checked it and it works as required
Comment 6 Simon Pilgrim 2016-11-12 09:43:32 PST
Reopening this. If we look at the equivalent c source then it appears that the vectorizer isn't working for non 128-bit vectors:

#include <x86intrin.h>

__m128 _vadd128(__m128 a, __m128 b) {
 return _mm_setr_ps(
   a[0] + b[0],
   a[1] + b[1],
   a[2] + b[2],
   a[3] + b[3]);
}

__m256 _vadd256(__m256 a, __m256 b) {
 return _mm256_setr_ps(
   a[0] + b[0],
   a[1] + b[1],
   a[2] + b[2],
   a[3] + b[3],
   a[4] + b[4],
   a[5] + b[5],
   a[6] + b[6],
   a[7] + b[7]);
}

clang -S -O3 pr6246.c -march=btver2 -o - -emit-llvm

; Function Attrs: norecurse nounwind readnone ssp uwtable
define <4 x float> @_vadd128(<4 x float> %a, <4 x float> %b) local_unnamed_addr #0 {
entry:
  %0 = fadd <4 x float> %a, %b
  ret <4 x float> %0
}

; Function Attrs: norecurse nounwind readnone ssp uwtable
define <8 x float> @_vadd256(<8 x float> %a, <8 x float> %b) local_unnamed_addr #0 {
entry:
  %vecext = extractelement <8 x float> %a, i32 0
  %vecext1 = extractelement <8 x float> %b, i32 0
  %add = fadd float %vecext, %vecext1
  %vecext2 = extractelement <8 x float> %a, i32 1
  %vecext3 = extractelement <8 x float> %b, i32 1
  %add4 = fadd float %vecext2, %vecext3
  %vecext5 = extractelement <8 x float> %a, i32 2
  %vecext6 = extractelement <8 x float> %b, i32 2
  %add7 = fadd float %vecext5, %vecext6
  %vecext8 = extractelement <8 x float> %a, i32 3
  %vecext9 = extractelement <8 x float> %b, i32 3
  %add10 = fadd float %vecext8, %vecext9
  %vecext11 = extractelement <8 x float> %a, i32 4
  %vecext12 = extractelement <8 x float> %b, i32 4
  %add13 = fadd float %vecext11, %vecext12
  %vecext14 = extractelement <8 x float> %a, i32 5
  %vecext15 = extractelement <8 x float> %b, i32 5
  %add16 = fadd float %vecext14, %vecext15
  %vecext17 = extractelement <8 x float> %a, i32 6
  %vecext18 = extractelement <8 x float> %b, i32 6
  %add19 = fadd float %vecext17, %vecext18
  %vecext20 = extractelement <8 x float> %a, i32 7
  %vecext21 = extractelement <8 x float> %b, i32 7
  %add22 = fadd float %vecext20, %vecext21
  %vecinit.i = insertelement <8 x float> undef, float %add, i32 0
  %vecinit1.i = insertelement <8 x float> %vecinit.i, float %add4, i32 1
  %vecinit2.i = insertelement <8 x float> %vecinit1.i, float %add7, i32 2
  %vecinit3.i = insertelement <8 x float> %vecinit2.i, float %add10, i32 3
  %vecinit4.i = insertelement <8 x float> %vecinit3.i, float %add13, i32 4
  %vecinit5.i = insertelement <8 x float> %vecinit4.i, float %add16, i32 5
  %vecinit6.i = insertelement <8 x float> %vecinit5.i, float %add19, i32 6
  %vecinit7.i = insertelement <8 x float> %vecinit6.i, float %add22, i32 7
  ret <8 x float> %vecinit7.i
}
Comment 7 Alexey Bataev 2016-12-01 14:52:21 PST
Must be fixed in r288412
Comment 8 Michael Kuperstein 2016-12-01 16:59:52 PST
r288412 was reverted in r288431, reopening.
Comment 9 Simon Pilgrim 2016-12-02 04:55:52 PST
I've added tests for this (128/256/512 bit float/double vectors) to llvm\test\Transforms\SLPVectorizer\X86\arith-fp.ll at r288492
Comment 10 Alexey Bataev 2016-12-02 09:07:35 PST
Now, I think, it is finally fixed in r288497
Comment 11 Michael Kuperstein 2016-12-02 12:58:16 PST
Reverted again in r288508, reopening.
Comment 12 Simon Pilgrim 2017-01-08 14:54:00 PST
Fixed at rL289043 - it seems to have stuck this time.