typedef char V __attribute__((vector_size(2))); V foo(V v) { v[(V){}[0]] <<= 1; return v; } Code such as this seems to be very badly optimized for all targets that have SIMD. With -O3, Clang outputs this: .LCPI0_0: .byte 0 # 0x0 .byte 255 # 0xff .byte 255 # 0xff .byte 255 # 0xff .byte 255 # 0xff .byte 255 # 0xff .byte 255 # 0xff .byte 255 # 0xff .byte 255 # 0xff .byte 255 # 0xff .byte 255 # 0xff .byte 255 # 0xff .byte 255 # 0xff .byte 255 # 0xff .byte 255 # 0xff .byte 255 # 0xff foo(char __vector(2)): # @foo(char __vector(2)) movd xmm0, edi movdqa xmmword ptr [rsp - 24], xmm0 mov al, byte ptr [rsp - 24] add al, al movzx eax, al movd xmm1, eax pxor xmm2, xmm2 punpcklbw xmm1, xmm2 # xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3],xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7] pand xmm0, xmmword ptr [rip + .LCPI0_0] por xmm0, xmm1 movd eax, xmm0 ret GCC outputs this: foo(char __vector(2)): movsx edx, dil mov eax, edi add edx, edx mov al, dl ret
Bitcasting between illegal vectors and scalars... define i16 @foo(i16 %v.coerce) { entry: %0 = bitcast i16 %v.coerce to <2 x i8> %vecext2 = extractelement <2 x i8> %0, i8 0 %shl = shl i8 %vecext2, 1 %vecins = insertelement <2 x i8> %0, i8 %shl, i8 0 %1 = bitcast <2 x i8> %vecins to i16 ret i16 %1 } @spatel Could/should vectorcombine catch this?
At first look, this seems like we missed some basic patterns in instcombine: https://alive2.llvm.org/ce/z/abytmj We don't do that, but we get more complicated cases like: define i32 @bitcasted_inselt_wide_source_zero_elt(i64 %x) { ; LE-LABEL: @bitcasted_inselt_wide_source_zero_elt( ; LE-NEXT: [[R:%.*]] = trunc i64 [[X:%.*]] to i32 ; LE-NEXT: ret i32 [[R]] ; ; BE-LABEL: @bitcasted_inselt_wide_source_zero_elt( ; BE-NEXT: [[TMP1:%.*]] = lshr i64 [[X:%.*]], 32 ; BE-NEXT: [[R:%.*]] = trunc i64 [[TMP1]] to i32 ; BE-NEXT: ret i32 [[R]] ; %i = insertelement <2 x i64> zeroinitializer, i64 %x, i32 0 %b = bitcast <2 x i64> %i to <4 x i32> %r = extractelement <4 x i32> %b, i32 0 ret i32 %r }
This doesn't help the example in the description (because of multi-use), but we should be able to build up from it: https://reviews.llvm.org/rGdb231ebdb07f
https://reviews.llvm.org/rGd95ebef4b8ec ...makes it less bad, but we still need something like this: https://alive2.llvm.org/ce/z/9RmeS6
(In reply to Sanjay Patel from comment #4) > https://reviews.llvm.org/rGd95ebef4b8ec > > ...makes it less bad, but we still need something like this: > https://alive2.llvm.org/ce/z/9RmeS6 Here's the endian-aware version of that proof: https://alive2.llvm.org/ce/z/Ux-662
This example should be fixed after: https://reviews.llvm.org/rG80ab06c599a0 But as noted in the commit message, the transform isn't completely general. So if you have other code like this that still has problems, please do file another bug. There's also a difference in the x86 codegen vs. gcc, so if the motivating program for this example is still slower, that might be another bug.