52057 – Failure to avoid using SIMD excessively on small vectors

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 52057 - Failure to avoid using SIMD excessively on small vectors

Summary: Failure to avoid using SIMD excessively on small vectors

Status:	RESOLVED FIXED

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Scalar Optimizations (show other bugs)
Version:	trunk
Hardware:	PC Linux

Importance:	P enhancement
Assignee:	Sanjay Patel

URL:
Keywords:

Depends on:
Blocks:

Reported:	2021-10-03 20:09 PDT by Gabriel Ravier
Modified:	2021-10-20 12:44 PDT (History)
CC List:	3 users (show)

See Also:
Fixed By Commit(s):

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Gabriel Ravier 2021-10-03 20:09:21 PDT

typedef char V __attribute__((vector_size(2)));

V foo(V v)
{
  v[(V){}[0]] <<= 1;
  return v;
}

Code such as this seems to be very badly optimized for all targets that have SIMD.

With -O3, Clang outputs this:

.LCPI0_0:
        .byte   0                               # 0x0
        .byte   255                             # 0xff
        .byte   255                             # 0xff
        .byte   255                             # 0xff
        .byte   255                             # 0xff
        .byte   255                             # 0xff
        .byte   255                             # 0xff
        .byte   255                             # 0xff
        .byte   255                             # 0xff
        .byte   255                             # 0xff
        .byte   255                             # 0xff
        .byte   255                             # 0xff
        .byte   255                             # 0xff
        .byte   255                             # 0xff
        .byte   255                             # 0xff
        .byte   255                             # 0xff
foo(char __vector(2)):                            # @foo(char __vector(2))
        movd    xmm0, edi
        movdqa  xmmword ptr [rsp - 24], xmm0
        mov     al, byte ptr [rsp - 24]
        add     al, al
        movzx   eax, al
        movd    xmm1, eax
        pxor    xmm2, xmm2
        punpcklbw       xmm1, xmm2              # xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3],xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]
        pand    xmm0, xmmword ptr [rip + .LCPI0_0]
        por     xmm0, xmm1
        movd    eax, xmm0
        ret

GCC outputs this:

foo(char __vector(2)):
        movsx   edx, dil
        mov     eax, edi
        add     edx, edx
        mov     al, dl
        ret

Comment 1 Simon Pilgrim 2021-10-04 01:34:53 PDT

Bitcasting between illegal vectors and scalars...

define i16 @foo(i16 %v.coerce) {
entry:
  %0 = bitcast i16 %v.coerce to <2 x i8>
  %vecext2 = extractelement <2 x i8> %0, i8 0
  %shl = shl i8 %vecext2, 1
  %vecins = insertelement <2 x i8> %0, i8 %shl, i8 0
  %1 = bitcast <2 x i8> %vecins to i16
  ret i16 %1
}


@spatel Could/should vectorcombine catch this?

Comment 2 Sanjay Patel 2021-10-04 05:04:54 PDT

At first look, this seems like we missed some basic patterns in instcombine:
https://alive2.llvm.org/ce/z/abytmj

We don't do that, but we get more complicated cases like:

define i32 @bitcasted_inselt_wide_source_zero_elt(i64 %x) {
; LE-LABEL: @bitcasted_inselt_wide_source_zero_elt(
; LE-NEXT:    [[R:%.*]] = trunc i64 [[X:%.*]] to i32
; LE-NEXT:    ret i32 [[R]]
;
; BE-LABEL: @bitcasted_inselt_wide_source_zero_elt(
; BE-NEXT:    [[TMP1:%.*]] = lshr i64 [[X:%.*]], 32
; BE-NEXT:    [[R:%.*]] = trunc i64 [[TMP1]] to i32
; BE-NEXT:    ret i32 [[R]]
;
  %i = insertelement <2 x i64> zeroinitializer, i64 %x, i32 0
  %b = bitcast <2 x i64> %i to <4 x i32>
  %r = extractelement <4 x i32> %b, i32 0
  ret i32 %r
}

Comment 3 Sanjay Patel 2021-10-06 06:46:28 PDT

This doesn't help the example in the description (because of multi-use), but we should be able to build up from it:
https://reviews.llvm.org/rGdb231ebdb07f

Comment 4 Sanjay Patel 2021-10-07 12:22:30 PDT

https://reviews.llvm.org/rGd95ebef4b8ec

...makes it less bad, but we still need something like this:
https://alive2.llvm.org/ce/z/9RmeS6

Comment 5 Sanjay Patel 2021-10-20 11:22:26 PDT

(In reply to Sanjay Patel from comment #4)
> https://reviews.llvm.org/rGd95ebef4b8ec
> 
> ...makes it less bad, but we still need something like this:
> https://alive2.llvm.org/ce/z/9RmeS6

Here's the endian-aware version of that proof:
https://alive2.llvm.org/ce/z/Ux-662

Comment 6 Sanjay Patel 2021-10-20 12:44:03 PDT

This example should be fixed after:
https://reviews.llvm.org/rG80ab06c599a0

But as noted in the commit message, the transform isn't completely general. So if you have other code like this that still has problems, please do file another bug.

There's also a difference in the x86 codegen vs. gcc, so if the motivating program for this example is still slower, that might be another bug.