49081 – [x86] Failure to optimize vector shuffle in conversion

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 49081 - [x86] Failure to optimize vector shuffle in conversion

Summary: [x86] Failure to optimize vector shuffle in conversion

Status:	CONFIRMED

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Backend: X86 (show other bugs)
Version:	trunk
Hardware:	PC Linux

Importance:	P enhancement
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:	35732
Blocks:
	Show dependency tree

Reported:	2021-02-07 13:07 PST by Gabriel Ravier
Modified:	2021-05-25 05:53 PDT (History)
CC List:	7 users (show)

See Also:
Fixed By Commit(s):

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Gabriel Ravier 2021-02-07 13:07:55 PST

typedef int v4si __attribute__((vector_size(16)));
typedef float v4sf __attribute__((vector_size(16)));

v4sf f(v4si f)
{
    return (v4sf){(float)f[1], (float)f[1], (float)f[2], (float)f[3]};
}

With -O3, GCC outputs this:

f(int __vector(4)):
  pshufd xmm0, xmm0, 229
  cvtdq2ps xmm0, xmm0
  ret

LLVM outputs this:

f(int __vector(4)):
  pshufd xmm1, xmm0, 85 # xmm1 = xmm0[1,1,1,1]
  cvtdq2ps xmm1, xmm1
  pshufd xmm2, xmm0, 238 # xmm2 = xmm0[2,3,2,3]
  cvtdq2ps xmm2, xmm2
  pshufd xmm0, xmm0, 255 # xmm0 = xmm0[3,3,3,3]
  cvtdq2ps xmm0, xmm0
  unpcklps xmm2, xmm0 # xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]
  shufps xmm1, xmm2, 64 # xmm1 = xmm1[0,0],xmm2[0,1]
  movaps xmm0, xmm1
  ret

Comment 1 Simon Pilgrim 2021-02-09 06:44:57 PST

Current Codegen: https://godbolt.org/z/oYn1df

Probably an SLP issue? [Bug #35732] looks very similar

Comment 2 Anton Afanasyev 2021-02-22 03:43:47 PST

This issue is to be fixed by http://reviews.llvm.org/D57059 after committing (in process of review now). We have conversion for f[1], f[2] and f[3] here, for three fps, so I have checked patch "Initial support for the vectorization of the non-power-of-2 vectors" fits here:

> ./opt -slp-vectorizer -instcombine -S pr49081.ll 
...
define dso_local <4 x float> @foo(<4 x i32> %0) {
  %shuffle = shufflevector <4 x i32> %0, <4 x i32> poison, <4 x i32> <i32 1, i32 2, i32 3, i32 undef>
  %2 = sitofp <4 x i32> %shuffle to <4 x float>
  %3 = shufflevector <4 x float> %2, <4 x float> undef, <4 x i32> <i32 0, i32 0, i32 1, i32 2>
  ret <4 x float> %3
}

Comment 3 Anton Afanasyev 2021-02-22 22:44:38 PST

Commited llvm/test/Transforms/SLPVectorizer/X86/pr49081.ll to track fixing.

Comment 4 Sanjay Patel 2021-05-25 05:53:17 PDT

(In reply to Anton Afanasyev from comment #2)
> This issue is to be fixed by http://reviews.llvm.org/D57059 after committing
> (in process of review now). We have conversion for f[1], f[2] and f[3] here,
> for three fps, so I have checked patch "Initial support for the
> vectorization of the non-power-of-2 vectors" fits here:
> 
> > ./opt -slp-vectorizer -instcombine -S pr49081.ll 
> ...
> define dso_local <4 x float> @foo(<4 x i32> %0) {
>   %shuffle = shufflevector <4 x i32> %0, <4 x i32> poison, <4 x i32> <i32 1,
> i32 2, i32 3, i32 undef>
>   %2 = sitofp <4 x i32> %shuffle to <4 x float>
>   %3 = shufflevector <4 x float> %2, <4 x float> undef, <4 x i32> <i32 0,
> i32 0, i32 1, i32 2>
>   ret <4 x float> %3
> }

For reference, that sequence was not optimizing in IR or backend, so added an instcombine transform to make it easier for SDAG:
https://reviews.llvm.org/rG0bab0f616119