https://godbolt.org/z/LvQiCP #include <x86intrin.h> auto add(__v4si x) { return _mm_set1_epi32(x[1] + x[3]); } _Z3addDv4_i: vextractps $1, %xmm0, %eax vextractps $3, %xmm0, %ecx addl %eax, %ecx vmovd %ecx, %xmm0 vpshufd $0, %xmm0, %xmm0 # xmm0 = xmm0[0,0,0,0] retq A more optimal method would be something like: add(int __vector(4)): vpshufd xmm1, xmm0, 78 # xmm1 = xmm0[2,3,0,1] vpaddd xmm0, xmm1, xmm0 vpshufd xmm0, xmm0, 85 # xmm0 = xmm0[1,1,1,1] ret
define <4 x i32> @extract_add_splat(<4 x i32> %x) { %e1 = extractelement <4 x i32> %x, i32 1 %e3 = extractelement <4 x i32> %x, i32 3 %a = add nsw i32 %e1, %e3 %i = insertelement <4 x i32> undef, i32 %a, i32 0 %r = shufflevector <4 x i32> %i, <4 x i32> undef, <4 x i32> zeroinitializer ret <4 x i32> %r } Are we committed to not vectorizing in SDAG? Changing that 'add' to <4 x i32> is an obvious win for any vector target that I can imagine. We're committed to not vectorizing in InstCombine, so that's out. I tried to figure out how to cram this into SLP, but I don't see how to do it without adding a big chunk of logic outside of everything that already exists. SLP just doesn't seem amenable to this kind of peephole opt. Various attempts so far: https://reviews.llvm.org/D59710 https://reviews.llvm.org/D64142 https://reviews.llvm.org/D66416 Should we add a "VectorCombine" IR pass? I'm imagining InstCombine-like iteration, but only on vector ops and with access to the cost model.
We have a vector combine pass now, so we get a vector add: https://reviews.llvm.org/D75689 https://reviews.llvm.org/rGa69158c12acd