Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[X86] Avoid scalar/vector transfers for scalar arithmetic #41978

Closed
RKSimon opened this issue Jul 16, 2019 · 2 comments
Closed

[X86] Avoid scalar/vector transfers for scalar arithmetic #41978

RKSimon opened this issue Jul 16, 2019 · 2 comments
Labels
backend:X86 bugzilla Issues migrated from bugzilla

Comments

@RKSimon
Copy link
Collaborator

RKSimon commented Jul 16, 2019

Bugzilla Link 42633
Resolution FIXED
Resolved on Mar 09, 2020 08:58
Version trunk
OS Windows NT
CC @topperc,@RKSimon,@rotateright
Fixed by commit(s) a69158c

Extended Description

https://godbolt.org/z/LvQiCP

#include <x86intrin.h>

auto add(__v4si x) {
return _mm_set1_epi32(x[1] + x[3]);
}

_Z3addDv4_i:
vextractps $1, %xmm0, %eax
vextractps $3, %xmm0, %ecx
addl %eax, %ecx
vmovd %ecx, %xmm0
vpshufd $0, %xmm0, %xmm0 # xmm0 = xmm0[0,0,0,0]
retq

A more optimal method would be something like:

add(int __vector(4)):
vpshufd xmm1, xmm0, 78 # xmm1 = xmm0[2,3,0,1]
vpaddd xmm0, xmm1, xmm0
vpshufd xmm0, xmm0, 85 # xmm0 = xmm0[1,1,1,1]
ret

@rotateright
Copy link
Contributor

define <4 x i32> @​extract_add_splat(<4 x i32> %x) {
%e1 = extractelement <4 x i32> %x, i32 1
%e3 = extractelement <4 x i32> %x, i32 3
%a = add nsw i32 %e1, %e3
%i = insertelement <4 x i32> undef, i32 %a, i32 0
%r = shufflevector <4 x i32> %i, <4 x i32> undef, <4 x i32> zeroinitializer
ret <4 x i32> %r
}

Are we committed to not vectorizing in SDAG? Changing that 'add' to <4 x i32> is an obvious win for any vector target that I can imagine.

We're committed to not vectorizing in InstCombine, so that's out.

I tried to figure out how to cram this into SLP, but I don't see how to do it without adding a big chunk of logic outside of everything that already exists. SLP just doesn't seem amenable to this kind of peephole opt.

Various attempts so far:
https://reviews.llvm.org/D59710
https://reviews.llvm.org/D64142
https://reviews.llvm.org/D66416

Should we add a "VectorCombine" IR pass? I'm imagining InstCombine-like iteration, but only on vector ops and with access to the cost model.

@rotateright
Copy link
Contributor

We have a vector combine pass now, so we get a vector add:
https://reviews.llvm.org/D75689
https://reviews.llvm.org/rGa69158c12acd

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

2 participants