42633 – [X86] Avoid scalar/vector transfers for scalar arithmetic

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 42633 - [X86] Avoid scalar/vector transfers for scalar arithmetic

Summary: [X86] Avoid scalar/vector transfers for scalar arithmetic

Status:	RESOLVED FIXED

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Backend: X86 (show other bugs)
Version:	trunk
Hardware:	PC Windows NT

Importance:	P enhancement
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2019-07-16 05:22 PDT by Simon Pilgrim
Modified:	2020-03-09 08:58 PDT (History)
CC List:	4 users (show)

See Also:	38705
Fixed By Commit(s):	a69158c12acd

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Simon Pilgrim 2019-07-16 05:22:32 PDT

https://godbolt.org/z/LvQiCP

#include <x86intrin.h>

auto add(__v4si x) {
    return _mm_set1_epi32(x[1] + x[3]);
}

_Z3addDv4_i:
  vextractps $1, %xmm0, %eax
  vextractps $3, %xmm0, %ecx
  addl %eax, %ecx
  vmovd %ecx, %xmm0
  vpshufd $0, %xmm0, %xmm0 # xmm0 = xmm0[0,0,0,0]
  retq


A more optimal method would be something like:

add(int __vector(4)):
  vpshufd xmm1, xmm0, 78          # xmm1 = xmm0[2,3,0,1]
  vpaddd  xmm0, xmm1, xmm0
  vpshufd xmm0, xmm0, 85          # xmm0 = xmm0[1,1,1,1]
  ret

Comment 1 Sanjay Patel 2019-09-18 04:32:02 PDT

define <4 x i32> @extract_add_splat(<4 x i32> %x) {
  %e1 = extractelement <4 x i32> %x, i32 1
  %e3 = extractelement <4 x i32> %x, i32 3
  %a = add nsw i32 %e1, %e3
  %i = insertelement <4 x i32> undef, i32 %a, i32 0
  %r = shufflevector <4 x i32> %i, <4 x i32> undef, <4 x i32> zeroinitializer
  ret <4 x i32> %r
}

Are we committed to not vectorizing in SDAG? Changing that 'add' to <4 x i32> is an obvious win for any vector target that I can imagine. 

We're committed to not vectorizing in InstCombine, so that's out.

I tried to figure out how to cram this into SLP, but I don't see how to do it without adding a big chunk of logic outside of everything that already exists. SLP just doesn't seem amenable to this kind of peephole opt. 

Various attempts so far: 
https://reviews.llvm.org/D59710 
https://reviews.llvm.org/D64142 
https://reviews.llvm.org/D66416

Should we add a "VectorCombine" IR pass? I'm imagining InstCombine-like iteration, but only on vector ops and with access to the cost model.

Comment 2 Sanjay Patel 2020-03-09 08:58:59 PDT

We have a vector combine pass now, so we get a vector add:
https://reviews.llvm.org/D75689
https://reviews.llvm.org/rGa69158c12acd